当前位置:网站首页>Flexible scale out: from file system to distributed file system

Flexible scale out: from file system to distributed file system

2022-06-25 20:33:00 Blog viewpoint

We use the file system all the time , Using a file system when developing , Using the file system when browsing the web , I also use the file system when playing with my mobile phone .

For non professionals , You may have no idea what a file system is . because , Generally speaking , When we use the file system, we generally don't feel the existence of the file system . Even program developers , Many people also know a little about file systems .

Although file systems are often not perceived , But the file system is very important . stay Linux in , File system is one of the four subsystems of its kernel ; Microsoft DOS(Disk Operating System, Disk management system ), The core is a file system that manages disks , This shows the importance of the file system .

01

Common file systems and classifications

In fact, the file system has developed to the present , Its species are also rich and diverse . such as , Disk based ordinary local file systems except Ext4, It also includes XFS、ZFS and Btrfs etc. . among Btrfs and ZFS You can manage more than one disk , It can also realize the management of multiple disks . More Than This , These two file systems realize the redundancy management of data , This can avoid data loss caused by disk failure . In addition to the file system for disk data management , There are also some network file systems . in other words , These file systems appear to be local , But in fact, the data is on remote special equipment . The client realizes data access through some network protocols , Such as NFS and GlusterFS Equal file system . After decades of development , There are many kinds of file systems , We can't introduce them one by one . The main file systems are introduced below .

Local file system

A local file system is a file system that manages disk space , It is also the most common form of file system . In terms of appearance , The local file system is a tree like directory structure . The essence of local file system is to manage disk space , Realize the conversion between disk linear space and directory hierarchy , As shown in the figure below .
 Insert picture description here

From the perspective of ordinary users , The local file system mainly facilitates the use of disk space , Reduce the difficulty of using , Improved utilization efficiency . Common local file systems are Ext4、Btrfs、XFS and ZFS etc. .

Pseudo file system

The pseudo file system is Linux The concept of , It is an extension of the traditional file system . Pseudo file systems do not persist data , It's a file system in memory . It is the interface between user and kernel data in the form of file system . Common pseudo file systems are proc、sysfs and configfs etc. .

stay Linux in , Pseudo file system mainly realizes the interaction between kernel and user state . such as , We use it a lot iostat Tools , It is essentially through access /proc/diskstats File to get information , As shown in the figure below . The file is a file in the pseudo file system , But its content is actually the statistics of disk access in the kernel , It is an instance of some data structures in the kernel .

Network file system

The network file system is based on TCP/IP agreement ( The entire agreement may cross layers ) File system , Allow one computer to access the file system of another computer , Just like accessing the local file system . Network file system is usually divided into client and server , The client is similar to the local file system , The server is the system that manages the data . The use of network file system is no different from that of local file system , Just execute mount Command to mount . There are also many kinds of network file systems , Such as NFS and SMB etc. .

At the user level , The mounted network file system is exactly the same as the local file system , I can't see any difference , Transparent to users . A network file system is like mapping a remote file system to a local file system . As shown in the figure below , On the left is the client , On the right is the file system server .
 Insert picture description here

When the file system exported by the server is mounted on the client , The directory tree of the server becomes a subtree of the directory tree of the client . This subdirectory is transparent to ordinary users , I don't realize that this is a remote directory , But actually read / The write request needs to be forwarded to the server through the network for processing .

Cluster file system

Cluster file system is essentially a local file system , It's just that it's usually built on Web-based SAN On the device , And share in multiple nodes SAN disk . The biggest feature of cluster file system is that it can realize the common access of client nodes to disk media , And the view is consistent , As shown in the figure below . This view consistency refers to , If at the node 0 Create a file , So at the node 1 And nodes 2 You can see immediately . This feature is actually similar to a network file system , The network file system can also see the modification of the file system by other clients on a client . But there are differences between the two , The cluster file system is essentially built on the client side , The network file system is built on the server .
 Insert picture description here

meanwhile , For clustered file systems , Its biggest feature is that multiple nodes can provide file system services for the application layer at the same time , It is especially suitable for the scenario of multiple business activities , Provide high availability clustering mechanism through cluster file system , Avoid service failure due to downtime .

distributed file system

essentially , In fact, distributed file system is also a kind of network file system . stay 《 Computer science and technology 》 The definition given in is “ A file system , The managed data resources are stored on distributed network nodes , Provide unified access to files ”, It can be seen that , The difference between distributed file system and network file system is that the server contains multiple nodes , That is, the server can be expanded horizontally . In terms of use , The use of distributed file system is not much different from that of network file system , Also through execution mount Command mount , The data of the client is transmitted to the server through the network for processing .

We find that the biggest disadvantage of the conventional network file system is that the server can not achieve horizontal expansion . This disadvantage is almost intolerable for large-scale Internet applications . This paper will introduce the distributed file system, which is widely used in the field of Internet . The biggest feature of distributed file system is that the server is realized through computer cluster , Horizontal expansion can be realized , The storage capacity and performance of the storage side can be improved approximately linearly by horizontal expansion .

02

What is a distributed file system

distributed file system (Distributed File System, abbreviation DFS) Is an extension of the network file system , The key point is that the storage side can flexibly scale horizontally . That is, by adding equipment ( Mainly servers ) A number of ways to expand the capacity and performance of the storage system . meanwhile , The distributed file system also provides a unified view to the client . in other words , Although the distributed file system service consists of multiple nodes , But the client is not aware . From the client's point of view, it is as if there is only one node providing services , And it is a unified distributed file system .

In a distributed file system , The most famous is Google's GFS. besides , There are also many open source distributed file systems , The well-known and widely used distributed file systems are HDFS、GlusterFS、CephFS、MooseFS and FastDFS etc. .

There are many specific methods to realize distributed file system , Different file systems are often used to solve different problems , There are also differences in Architecture . Although distributed file systems have many differences , But there are many common technical points .

In some cases ,NFS And other network file systems are also called distributed file systems . But in this article , Distributed file system refers to the file system that the server can expand horizontally . in other words , The biggest feature of distributed file system is that it can increase the capacity of file system by adding nodes , Lifting performance .

Of course , Distributed file systems have many things in common with network file systems . such as , Distributed file system is also divided into client file system and server service program . meanwhile , Because the client and server are separated , The distributed file system should also realize similar functions in the network file system RPC The agreement .

in addition , Distributed file system because its data is stored on multiple nodes , So there are other features . Including but not limited to the following .

Support the placement of data on multiple nodes according to established policies .

It can ensure that in case of hardware failure , Still have access to data .

It can ensure that in case of hardware failure , No data loss .

In case of hardware failure recovery , Ensure data synchronization .

It can ensure the consistency of data accessed by multiple nodes .

Because the distributed file system requires the client to interact with multiple servers , And it is necessary to realize the fault tolerance of the server , Generally speaking , Distributed file systems implement private protocols , Instead of using NFS And so on .

03

Common distributed file systems

There are many specific implementation methods of distributed file system , In fact, there were some distributed file systems long before the Internet flourished , Such as Lustre etc. . Early distributed file systems were more used in supercomputing .

With the development of Internet technology , Especially Google's GFS The publication of the paper , Distributed file system has been further developed . at present , Many distributed file systems refer to Google's release on GFS The paper realizes . such as , In the field of big data HDFS And some open source distributed file systems FastDFS and CephFS etc. .

In terms of open source distributed file system , More well-known projects are in the field of big data HDFS And universal CephFS and GlusterFS etc. . These open source projects are used more in actual production . Next, we will briefly introduce the common distributed file system .

GFS

GFS It's a distributed file system of Google , The distributed file system is introduced in this paper The Google File System Well known to the world .GFS There is no standard file interface , That is, the interface it implements is not related to POSIX compatible . But it includes creating 、 Delete 、 open 、 Close and read / Write and other basic interfaces .

GFS Cluster nodes include two basic roles : One is master, The node in this role is responsible for file system level metadata management ; The other is chunkserver, The role usually has many nodes , Used to store actual data .GFS The management of documents is in master Accomplished , And the actual reading of data / Writing can be directly related to chunkserver Interaction , avoid master Become a performance bottleneck .

GFS Many assumptions are made in the implementation , If the hardware is an ordinary commercial server 、 The file size is hundreds of megabytes or more, and the load is dominated by sequential large readers . among , Assumptions about file size are particularly important . Based on this assumption ,GFS By default, the file is cut into 64MB Logical blocks of size (chunk), Every chunk Generate a 64 Bit handle , from master Conduct management .

What needs to be emphasized here is , Every chunk Life cycle and positioning are determined by master Managed , however chunk The data is stored in chunkserver Of . It is this architecture , When the client gets chunk The location and access rights can be directly connected with chunkserver Interaction , Without the need for master Participate in , And then avoid master Become a bottleneck .

As shown below GFS Schematic architecture .
 Insert picture description here

except GFS, There are many distributed file systems with similar architectures . such as , In the field of big data HDFS, It is dedicated to Hadoop Distributed file system for big data storage . Its architecture is similar to GFS The architecture is similar , It includes a node for managing metadata and multiple nodes for storing data , Respectively namenode and datanode.

HDFS It is mainly used for processing large files , It cuts the file to a fixed size , Then store it to the data node . At the same time, in order to ensure the reliability of data , These data are put into many different data nodes . The size of the file being cut and the number of data nodes placed at the same time ( replications ) It's configurable .

although HDFS It is designed for large files , But you can also handle small files . It is just that files smaller than the cutting unit are not cut . in addition ,HDFS Some optimizations have also been made for small files , Such as HAR and SequenceFile And so on , but HDFS After all, it is not specially designed for small files , Therefore, there are still some deficiencies in performance .

besides , There are many imitations GFS Open source distributed file system based on , Such as FastDFS、MooseFS and BFS etc. . But most open source projects only implement the most basic semantics of the file system , Strictly speaking, it cannot be called a distributed file system , More like object storage .

CephFS

It is necessary to introduce CephFS The reason is that CephFS It not only realizes all the semantics of the file system , Moreover, it realizes the multi activity horizontal extension of metadata service .

CephFS Architecture and GFS The architecture doesn't make much difference , Its outstanding feature is that it will GFS Single activity master The node is extended to multiple live nodes . Not only can metadata be more flexible , Moreover, the load can be dynamically balanced according to the load of metadata nodes . such ,CephFS We can not only realize the horizontal expansion of metadata by adding nodes , You can also adjust the node load , Make maximum use of the of each node CPU resources .

meanwhile ,CephFS Realized with POSIX Language compatibility , Two file system implementations, kernel state and user state, are completed on the client side . When the user mounts CephFS after , Using this file system can be as convenient as using the local file system .

GlusterFS

GlusterFS Is a very historical distributed file system , Its biggest feature is that there is no central node . That is to say GlusterFS There is no special metadata node to manage metadata of the entire file system .

GlusterFS Abstract out the volume (Volume) The concept of , It should be noted that , The volume here is similar to Linux LVM Volumes in are not the same concept . The volume here is an abstraction of the file system , Represents a file system instance . When we create a volume on the cluster side , In fact, a file system instance is created .

GlusterFS There are many different types of volumes , Such as replica volume 、 Striped volumes and distributed volumes, etc . It is through the combination of these volume characteristics ,GlusterFS The ability of data reliability and horizontal expansion is realized .

This article is excerpted from 2022 The latest published in 《 File system technology insider : The way of massive data storage in the era of big data 》 A Book , Welcome to this book to learn more about file system !
 Insert picture description here

▊《 File system technology insider : The way of massive data storage in the era of big data 》

Zhangshuning Writing

Employees in relevant frontier fields 、 Professional programmers 、 Architectors are worth learning and collecting file system classics

In order to give readers a deeper understanding of the principle of file system , This book not only introduces the principle and key technology of file system , It also introduces the implementation details of the file system combined with the open source project . Last , This book introduces object storage, which is widely used in the field of Internet 、 The principle of carrying massive access requests and the architecture that can store massive data . It is hoped that readers will have a comprehensive understanding of the file system through reading this book 、 Have a deep understanding of . It can be used as a guide for developers of file system and other storage systems , Or as a software architect 、 Programmers and Linux Reference book for operation and maintenance personnel .
 Insert picture description here

( Scan code for details of this book !)
 Insert picture description here

原网站

版权声明
本文为[Blog viewpoint]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202181347416341.html