Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part I)

Yun Zhisheng , It is an artificial intelligence service company focusing on the Internet of things . Yun Zhisheng's AI The technology stack covers signals 、 voice 、 Images 、 The ability to perceive and express text , knowledge 、 understand 、 analysis 、 Cognitive technologies such as decision-making , And towards the direction of multi-modal artificial intelligence system . Yun Zhisheng Atlas Supercomputing platform as the underlying infrastructure , Support the company in AI Model training and reasoning services in various fields . Yunzhisheng has long begun to build an industry-leading GPU/CPU isomerism Atlas Computing platform and distributed file storage system , The computing cluster can be AI Computing provides high-performance computing and storage and access of massive data . Cloud Zhisheng team is based on Kubernetes Based on open source architecture , The corresponding core functions are developed , Successfully built a floating-point processor with more than 10 PFLOPS（ 100 million times ／ second ） Of AI Supercomputing service platform . The platform supports the mainstream machine learning architecture , Developers can implement voice 、 Language 、 big data 、 Efficient research and development of multimodal and other core technologies . The platform also opens up the corresponding computing power and storage , Provide customized computing services for small, medium-sized and micro enterprises and institutions .

Atlas The computing platform adopts the architecture of separation of computing and storage , At present, the storage server of the whole platform 、 The underlying network architecture between computing servers and between computing and storage servers is composed of 100GB Of InfiniBand To connect .

The model training data storage system of the computing platform consists of multiple sets PB High performance distributed file system Lustre form .Lustre Distributed file system compatibility POSIX Interface , A variety of deep learning frameworks can read data directly . The architecture of separation of computing and storage enables computing and storage to expand independently , The overall architecture is flexible . However, the previous platform also encountered problems such as low data access efficiency and underlying storage bandwidth bottleneck ：
Storage bandwidth bottleneck

When the storage resources are relatively fixed , With the increase of platform users , Its bandwidth 、 Both metadata load and server load show a large increase . There are multiple stand-alone tasks running in the same cluster GPU node , cause IO Competition of resources , because IO The competition led to the lengthening of the whole training cycle , Greatly reduce the impact of R & D efficiency .
A lot of small files

The second problem is the characteristics of the model training data set itself . In the noise reduction scene, there is a user's task TB Small files of magnitude , The metadata service of the underlying distributed file system is under great pressure . A large number of small files make the program itself less efficient in reading data , Slow data reading causes GPU Most of the time waiting for the data , whole GPU The overall utilization of is low , The training cycle of the model is prolonged .
There are many kinds of data

Because the platform supports a wide range of business types , Users have many data types , File sizes and types are also different , It is difficult to adapt to multiple business types by tuning a set of stored parameters . Combined with the user's business type analysis , We found that the platform data is mainly used for model training, accounting for a large proportion , The rest mainly carries on the reasoning and analysis of the model CPU Intensive data generation tasks .
data redundancy

There is a problem of data set overlap in the platform , The same data set is used in the same group or different groups , But it stores multiple copies , It's a waste of storage space .

How to deal with the bottleneck of total storage bandwidth and reduce the pressure of metadata server with minimal budget and architecture changes , Yun Zhisheng Atlas Also carry out a series of exploration and research and development .
Bandwidth limitation

Considering that a large number of concurrent reads will cause the storage bandwidth to reach the limit , Cause storage jam or storage system paralysis . The platform limits the client bandwidth of each computing node and the bandwidth of each user UID/GID Limit bandwidth , However, this method is not flexible enough 、 Can't make full use of GPU The problem of computing power , When there is 2 A big IO The training task is assigned to the same node , Due to the bandwidth limitation of the node ,2 A training task IO There's a ceiling , The speed limit of data reading leads to GPU There is no way to give full play to the advantages of parallel reading , This type of GPU Utilization is in 40% about , A serious waste of hardware resources .
Aggregate large files

Considering that there are too many small files on the platform , It will put great pressure on metadata , We have taken corresponding measures , First, by detecting each user's inode Quantity and total storage determine the number of users' small files , Limit the number of user small files . The second is to implement a series of data aggregation tools , Let users aggregate small files into lmdb、tfrecord Large file formats such as .
Task scheduler refactoring

In order to avoid tasks clustered on the same node , We customize the task scheduler plug-in , Add scheduling policy , Judge the usage of the node's computing resources , Give priority to scheduling tasks to idle nodes , Avoid multiple tasks running on the same node IO competition , However, this scheme is inevitable when the computing resources of the platform are full .
Multi level cache

In order to make full use of idle hardware and reduce the pressure on the underlying storage system , In the earliest platform, the first version of caching scheme was developed as a transition . This method can relieve the pressure of storage to a certain extent , But its data management is not automatic enough , This can only be used as a temporary alternative solution for our transition to the new architecture .

Yun Zhisheng is in 2020 Research began in Alluxio And carried out a series of tests , Including function adaptation and performance test , Find out Alluxio It can meet the current needs of cloud Zhisheng , We can solve several pain points quickly and at a low cost ：

Alluxio Fuse Provides POSIX File system interface , Users can seamlessly use distributed caching , The program does not need to be changed ;

Alluxio Support for multiple file systems , Including distributed file systems 、 Object storage, etc , When we subsequently introduce new storage platforms ,Alluxio Caching can be well supported , Ensure the stability of our entire cache architecture ;

Alluxio Provide better cache management ,Alluxio The hierarchical storage mechanism can make full use of memory 、 Solid state drives or disks , Reduce the cost of data-driven applications with elastic expansion characteristics ;

Support with Kubernetes Or deploy to the platform as a container , It is consistent with the existing technology stack ;

Alluxio Provides HA Support , It ensures the high availability of the distributed cache system

Compared with the earlier architecture of separation of computing and storage ,Alluxio Introduce a layer between computing and storage Cache layer , Transfer the pressure of the underlying storage to the memory or local hard disk of each computing node , Users' tasks can enjoy the speed improvement advantages brought by local storage , The whole platform can be compatible with both distributed file system and local hard disk .

In the use of Alluxio When doing business side consolidation , We encountered permission control 、 And data mounting .Fluid Provides a more cloud native way to use Alluxio, It provides a new way of data set management , Cached datasets are the same as cloud native resources , It can be kubernetes Make corresponding allocation and scheduling , Effectively solve the problem of early cache and kubernetes The problem of independent use .

Ultimately, our architecture is to use Alluxio As Fluid Cache acceleration engine , It is responsible for data migration from the underlying distributed file system to the local cache media of the computing node and cache management , It provides data acceleration function for the application of the platform . and Fluid Responsible for caching and application scheduling , be based on Fluid, Platform aware caching , Will require a lot of manual caching operations before , Transfer to the platform layer for intelligent processing .

After introducing the new architecture , We are developing a model training task submission tool atlasctl take fluid Function integration , Shield some complex concepts for users as much as possible , User pass atlasctl cache create And specify to add some parameter information, such as the size of the cache , Cache media, etc. to create a cache dataset . The integration of the tool shields the load caching mechanism information for users , Let them pay more attention to the data and the business itself .

 To be continued

当前位置：网站首页>Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part I)

Yunzhisheng atlas supercomputing platform: computing acceleration practice based on fluid + alluxio (Part I)

边栏推荐

猜你喜欢

随机推荐