当前位置:网站首页>Flexible scale out: from file system to distributed file system
Flexible scale out: from file system to distributed file system
2022-06-25 20:33:00 【Blog viewpoint】
We use the file system all the time , Using a file system when developing , Using the file system when browsing the web , I also use the file system when playing with my mobile phone .
For non professionals , You may have no idea what a file system is . because , Generally speaking , When we use the file system, we generally don't feel the existence of the file system . Even program developers , Many people also know a little about file systems .
Although file systems are often not perceived , But the file system is very important . stay Linux in , File system is one of the four subsystems of its kernel ; Microsoft DOS(Disk Operating System, Disk management system ), The core is a file system that manages disks , This shows the importance of the file system .
01
Common file systems and classifications
In fact, the file system has developed to the present , Its species are also rich and diverse . such as , Disk based ordinary local file systems except Ext4, It also includes XFS、ZFS and Btrfs etc. . among Btrfs and ZFS You can manage more than one disk , It can also realize the management of multiple disks . More Than This , These two file systems realize the redundancy management of data , This can avoid data loss caused by disk failure . In addition to the file system for disk data management , There are also some network file systems . in other words , These file systems appear to be local , But in fact, the data is on remote special equipment . The client realizes data access through some network protocols , Such as NFS and GlusterFS Equal file system . After decades of development , There are many kinds of file systems , We can't introduce them one by one . The main file systems are introduced below .
Local file system
A local file system is a file system that manages disk space , It is also the most common form of file system . In terms of appearance , The local file system is a tree like directory structure . The essence of local file system is to manage disk space , Realize the conversion between disk linear space and directory hierarchy , As shown in the figure below .
From the perspective of ordinary users , The local file system mainly facilitates the use of disk space , Reduce the difficulty of using , Improved utilization efficiency . Common local file systems are Ext4、Btrfs、XFS and ZFS etc. .
Pseudo file system
The pseudo file system is Linux The concept of , It is an extension of the traditional file system . Pseudo file systems do not persist data , It's a file system in memory . It is the interface between user and kernel data in the form of file system . Common pseudo file systems are proc、sysfs and configfs etc. .
stay Linux in , Pseudo file system mainly realizes the interaction between kernel and user state . such as , We use it a lot iostat Tools , It is essentially through access /proc/diskstats File to get information , As shown in the figure below . The file is a file in the pseudo file system , But its content is actually the statistics of disk access in the kernel , It is an instance of some data structures in the kernel .
Network file system
The network file system is based on TCP/IP agreement ( The entire agreement may cross layers ) File system , Allow one computer to access the file system of another computer , Just like accessing the local file system . Network file system is usually divided into client and server , The client is similar to the local file system , The server is the system that manages the data . The use of network file system is no different from that of local file system , Just execute mount Command to mount . There are also many kinds of network file systems , Such as NFS and SMB etc. .
At the user level , The mounted network file system is exactly the same as the local file system , I can't see any difference , Transparent to users . A network file system is like mapping a remote file system to a local file system . As shown in the figure below , On the left is the client , On the right is the file system server .
When the file system exported by the server is mounted on the client , The directory tree of the server becomes a subtree of the directory tree of the client . This subdirectory is transparent to ordinary users , I don't realize that this is a remote directory , But actually read / The write request needs to be forwarded to the server through the network for processing .
Cluster file system
Cluster file system is essentially a local file system , It's just that it's usually built on Web-based SAN On the device , And share in multiple nodes SAN disk . The biggest feature of cluster file system is that it can realize the common access of client nodes to disk media , And the view is consistent , As shown in the figure below . This view consistency refers to , If at the node 0 Create a file , So at the node 1 And nodes 2 You can see immediately . This feature is actually similar to a network file system , The network file system can also see the modification of the file system by other clients on a client . But there are differences between the two , The cluster file system is essentially built on the client side , The network file system is built on the server .
meanwhile , For clustered file systems , Its biggest feature is that multiple nodes can provide file system services for the application layer at the same time , It is especially suitable for the scenario of multiple business activities , Provide high availability clustering mechanism through cluster file system , Avoid service failure due to downtime .
distributed file system
essentially , In fact, distributed file system is also a kind of network file system . stay 《 Computer science and technology 》 The definition given in is “ A file system , The managed data resources are stored on distributed network nodes , Provide unified access to files ”, It can be seen that , The difference between distributed file system and network file system is that the server contains multiple nodes , That is, the server can be expanded horizontally . In terms of use , The use of distributed file system is not much different from that of network file system , Also through execution mount Command mount , The data of the client is transmitted to the server through the network for processing .
We find that the biggest disadvantage of the conventional network file system is that the server can not achieve horizontal expansion . This disadvantage is almost intolerable for large-scale Internet applications . This paper will introduce the distributed file system, which is widely used in the field of Internet . The biggest feature of distributed file system is that the server is realized through computer cluster , Horizontal expansion can be realized , The storage capacity and performance of the storage side can be improved approximately linearly by horizontal expansion .
02
What is a distributed file system
distributed file system (Distributed File System, abbreviation DFS) Is an extension of the network file system , The key point is that the storage side can flexibly scale horizontally . That is, by adding equipment ( Mainly servers ) A number of ways to expand the capacity and performance of the storage system . meanwhile , The distributed file system also provides a unified view to the client . in other words , Although the distributed file system service consists of multiple nodes , But the client is not aware . From the client's point of view, it is as if there is only one node providing services , And it is a unified distributed file system .
In a distributed file system , The most famous is Google's GFS. besides , There are also many open source distributed file systems , The well-known and widely used distributed file systems are HDFS、GlusterFS、CephFS、MooseFS and FastDFS etc. .
There are many specific methods to realize distributed file system , Different file systems are often used to solve different problems , There are also differences in Architecture . Although distributed file systems have many differences , But there are many common technical points .
In some cases ,NFS And other network file systems are also called distributed file systems . But in this article , Distributed file system refers to the file system that the server can expand horizontally . in other words , The biggest feature of distributed file system is that it can increase the capacity of file system by adding nodes , Lifting performance .
Of course , Distributed file systems have many things in common with network file systems . such as , Distributed file system is also divided into client file system and server service program . meanwhile , Because the client and server are separated , The distributed file system should also realize similar functions in the network file system RPC The agreement .
in addition , Distributed file system because its data is stored on multiple nodes , So there are other features . Including but not limited to the following .
Support the placement of data on multiple nodes according to established policies .
It can ensure that in case of hardware failure , Still have access to data .
It can ensure that in case of hardware failure , No data loss .
In case of hardware failure recovery , Ensure data synchronization .
It can ensure the consistency of data accessed by multiple nodes .
Because the distributed file system requires the client to interact with multiple servers , And it is necessary to realize the fault tolerance of the server , Generally speaking , Distributed file systems implement private protocols , Instead of using NFS And so on .
03
Common distributed file systems
There are many specific implementation methods of distributed file system , In fact, there were some distributed file systems long before the Internet flourished , Such as Lustre etc. . Early distributed file systems were more used in supercomputing .
With the development of Internet technology , Especially Google's GFS The publication of the paper , Distributed file system has been further developed . at present , Many distributed file systems refer to Google's release on GFS The paper realizes . such as , In the field of big data HDFS And some open source distributed file systems FastDFS and CephFS etc. .
In terms of open source distributed file system , More well-known projects are in the field of big data HDFS And universal CephFS and GlusterFS etc. . These open source projects are used more in actual production . Next, we will briefly introduce the common distributed file system .
GFS
GFS It's a distributed file system of Google , The distributed file system is introduced in this paper The Google File System Well known to the world .GFS There is no standard file interface , That is, the interface it implements is not related to POSIX compatible . But it includes creating 、 Delete 、 open 、 Close and read / Write and other basic interfaces .
GFS Cluster nodes include two basic roles : One is master, The node in this role is responsible for file system level metadata management ; The other is chunkserver, The role usually has many nodes , Used to store actual data .GFS The management of documents is in master Accomplished , And the actual reading of data / Writing can be directly related to chunkserver Interaction , avoid master Become a performance bottleneck .
GFS Many assumptions are made in the implementation , If the hardware is an ordinary commercial server 、 The file size is hundreds of megabytes or more, and the load is dominated by sequential large readers . among , Assumptions about file size are particularly important . Based on this assumption ,GFS By default, the file is cut into 64MB Logical blocks of size (chunk), Every chunk Generate a 64 Bit handle , from master Conduct management .
What needs to be emphasized here is , Every chunk Life cycle and positioning are determined by master Managed , however chunk The data is stored in chunkserver Of . It is this architecture , When the client gets chunk The location and access rights can be directly connected with chunkserver Interaction , Without the need for master Participate in , And then avoid master Become a bottleneck .
As shown below GFS Schematic architecture .
except GFS, There are many distributed file systems with similar architectures . such as , In the field of big data HDFS, It is dedicated to Hadoop Distributed file system for big data storage . Its architecture is similar to GFS The architecture is similar , It includes a node for managing metadata and multiple nodes for storing data , Respectively namenode and datanode.
HDFS It is mainly used for processing large files , It cuts the file to a fixed size , Then store it to the data node . At the same time, in order to ensure the reliability of data , These data are put into many different data nodes . The size of the file being cut and the number of data nodes placed at the same time ( replications ) It's configurable .
although HDFS It is designed for large files , But you can also handle small files . It is just that files smaller than the cutting unit are not cut . in addition ,HDFS Some optimizations have also been made for small files , Such as HAR and SequenceFile And so on , but HDFS After all, it is not specially designed for small files , Therefore, there are still some deficiencies in performance .
besides , There are many imitations GFS Open source distributed file system based on , Such as FastDFS、MooseFS and BFS etc. . But most open source projects only implement the most basic semantics of the file system , Strictly speaking, it cannot be called a distributed file system , More like object storage .
CephFS
It is necessary to introduce CephFS The reason is that CephFS It not only realizes all the semantics of the file system , Moreover, it realizes the multi activity horizontal extension of metadata service .
CephFS Architecture and GFS The architecture doesn't make much difference , Its outstanding feature is that it will GFS Single activity master The node is extended to multiple live nodes . Not only can metadata be more flexible , Moreover, the load can be dynamically balanced according to the load of metadata nodes . such ,CephFS We can not only realize the horizontal expansion of metadata by adding nodes , You can also adjust the node load , Make maximum use of the of each node CPU resources .
meanwhile ,CephFS Realized with POSIX Language compatibility , Two file system implementations, kernel state and user state, are completed on the client side . When the user mounts CephFS after , Using this file system can be as convenient as using the local file system .
GlusterFS
GlusterFS Is a very historical distributed file system , Its biggest feature is that there is no central node . That is to say GlusterFS There is no special metadata node to manage metadata of the entire file system .
GlusterFS Abstract out the volume (Volume) The concept of , It should be noted that , The volume here is similar to Linux LVM Volumes in are not the same concept . The volume here is an abstraction of the file system , Represents a file system instance . When we create a volume on the cluster side , In fact, a file system instance is created .
GlusterFS There are many different types of volumes , Such as replica volume 、 Striped volumes and distributed volumes, etc . It is through the combination of these volume characteristics ,GlusterFS The ability of data reliability and horizontal expansion is realized .
This article is excerpted from 2022 The latest published in 《 File system technology insider : The way of massive data storage in the era of big data 》 A Book , Welcome to this book to learn more about file system !
▊《 File system technology insider : The way of massive data storage in the era of big data 》
Zhangshuning Writing
Employees in relevant frontier fields 、 Professional programmers 、 Architectors are worth learning and collecting file system classics
In order to give readers a deeper understanding of the principle of file system , This book not only introduces the principle and key technology of file system , It also introduces the implementation details of the file system combined with the open source project . Last , This book introduces object storage, which is widely used in the field of Internet 、 The principle of carrying massive access requests and the architecture that can store massive data . It is hoped that readers will have a comprehensive understanding of the file system through reading this book 、 Have a deep understanding of . It can be used as a guide for developers of file system and other storage systems , Or as a software architect 、 Programmers and Linux Reference book for operation and maintenance personnel .
( Scan code for details of this book !)
边栏推荐
- Leetcode daily question - 27 Remove element (simple)
- The error log of vscode connecting to the server shows the problem of "insufficient permission". Directly use root to connect
- Interviewer: why does TCP shake hands three times and break up four times? Most people can't answer!
- Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part 2)
- Reasons for network timeout app flash back
- Detailed explanation of unified monitoring function of multi cloud virtual machine
- How to close gracefully after using jedis
- Causes and solutions of unreliable JS timer execution
- What is the core journal of Peking University? An article will help you understand it thoroughly
- Recommend a free screen recording software
猜你喜欢
Redis practice: smart use of data types to achieve 100 million level data statistics
Avoid material "minefields"! Play super high conversion rate
8 minutes to understand the wal mechanism of tdengine
Understanding C language structure pointer
Install and initialize MySQL (under Windows)
Now meditation: crash service and performance service help improve application quality
R language quantile autoregressive QAR analysis pain index: time series of unemployment rate and inflation rate
The live registration is hot to start | the first show of Apache dolphin scheduler meetup in 2022!
Interface automation -md5 password encryption
II Traits (extractors)
随机推荐
How can the intelligent transformation path of manufacturing enterprises be broken due to talent shortage and high cost?
From URL to access page rendering
New features of redis 6.0: take you 100% to master the multithreading model
Ensure the decentralization and availability of Oracle network
201604-4 test title: Game (BFS search maze)
Lantern Festival, learning at the right time! Novice training camp attacks again, learning buff continues to fill up
String since I can perform performance tuning, I can call an expert directly
Uncover n core 'black magic' of Presto + alluxio
node. JS express connect mysql write webapi Foundation
very good
HMS core actively explores the function based on hardware ear return, helping to reduce the overall singing delay rate of the singing bar by 60%
Interview shock: talk about thread life cycle and transformation process?
Recommend a free screen recording software
Section 13: simplify your code with Lombok
Leetcode topic [array] -33- search rotation sort array
Modifying routes without refreshing the interface
Redis common principles interview
Huawei HMS core launched a new member conversion & retention prediction model
What is an app circle of friends advertisement
Teach you how to create and publish a packaged NPM component