Cloud native practice of meituan cluster scheduling system

This paper introduces how meituan solves the problem of large-scale cluster management 、 Practice in designing excellent and reasonable cluster scheduling system , This paper expounds that meituan is landing to Kubernetes As a representative of cloud native technology , More concerned about 、 Challenges and corresponding promotion strategies . At the same time, this paper also introduces some special support for meituan's business demand scenario , I hope this article can help or inspire students interested in the field of cloud primitives .

Introduction

Cluster scheduling system plays an important role in enterprise data center , With the increasing number of clusters and Applications , The complexity of developers dealing with business problems has also increased significantly . How to solve the problem of large-scale cluster management , Design excellent and reasonable cluster scheduling system , Ensure stability , lower the cost , Improve efficiency ？ This article will answer one by one .

| remarks ： The article was first published in 《 New programmers 003》 Developer column in cloud native Era .

Introduction to cluster scheduling system

Cluster scheduling system , Also known as data center resource scheduling system , It is widely used to solve the problems of resource management and task scheduling in data center , Its goal is to make effective use of data center resources , Improve the utilization of resources , And provide the business party with automatic operation and maintenance capability , Reduce the operation and maintenance management cost of services . The well-known cluster dispatching system in the industry , Like open source OpenStack、YARN、Mesos and Kubernetes wait , Another example is the well-known Internet company Google Of Borg、 Microsoft Apollo、 Baidu Matrix、 alibaba.com Fuxi and ASI.

Cluster dispatching system is the core of Internet companies IaaS infrastructure , In recent ten years, it has experienced many architectural evolution . With the business from single frame structure to SOA（ Service Oriented Architecture ） Evolution and development of microservices , At the bottom IaaS Facilities have also gradually crossed from the era of physical bare metal machines to the era of containers . Although the core issues we have to deal with in the process of evolution have not changed , However, due to the rapid expansion of cluster scale and the number of applications , The complexity of the problem also increases exponentially . This paper will explain the challenge of large-scale cluster management and the design idea of cluster scheduling system , Taking the implementation practice of meituan cluster dispatching system as an example , Talk about how to build a multi cluster unified scheduling service , Continuously improve the utilization of resources , Provide Kubernetes Engine service enablement PaaS Components , A series of cloud native practices such as providing better computing service experience for business .

The problem of large-scale cluster management

as everyone knows , The rapid growth of business has brought about a sharp increase in the size of servers and the number of data centers . For developers , In the business scenario of large-scale cluster scheduling system , The two problems that must be solved are ：

How to manage Large scale deployment of data centers , Especially in the cross data center scenario , How to realize resource flexibility and scheduling ability , On the premise of ensuring the application service quality, improve the utilization of resources as much as possible , Fully reduce data center costs .
How to transform the underlying infrastructure , For the business party Build a cloud native operating system , Improve the computing service experience , Realize the automatic disaster recovery response, deployment and upgrading of applications , Reduce the mental burden of the business side on the underlying resource management , So that the business side can focus more on the business itself .

The challenge of running large-scale clusters

In order to solve the above two problems in the real production environment , Specifically, it can be further divided into the following four large-scale cluster operation and management challenges ：

How to solve the diversified needs of users and respond quickly . The business scheduling requirements and scenarios are rich and dynamic , As a platform service such as cluster scheduling system , On the one hand, we need to be able to quickly deliver functions , Meet business needs in a timely manner ; On the other hand, we need to make the platform universal enough , Abstract personalized business requirements into common capabilities that can be implemented on the platform , And long-term iteration . This is a great test of the technology evolution planning of the platform service team , Because of carelessness , The team will fall into endless business function development , While meeting business needs , But it will cause low-level duplication of team work .
How to improve the resource utilization of online application data center and ensure the application service quality at the same time . Resource scheduling has always been a recognized problem in the industry , With the rapid development of cloud computing market , Cloud computing vendors are increasing their investment in data centers . The resource utilization rate of the data center is very low , Exacerbated the seriousness of the problem .Gartner The survey found that global data center servers CPU Utilization is only 6%～12%, Even Amazon's elastic computing cloud platform （EC2,Elastic Compute Cloud） There is only a 7%～17% Resource utilization of , It can be seen how serious the waste of resources is . The reason is , Online applications are very sensitive to resource utilization , The industry has to reserve additional resources to ensure the service quality of important applications （QoS,Qualityof Service）. It is necessary to eliminate the interference of multi application cluster scheduling , Realize resource isolation between different applications .
How to apply , In particular, stateful applications provide automatic handling of instance exceptions , Shielded machine room differences , Reduce users' perception of the underlying . With the continuous expansion of service application scale , And the growing maturity of the cloud computing market , Distributed applications are often configured in data centers in different regions , Even across different cloud environments , Implementation of multi cloud or hybrid cloud deployment . The cluster dispatching system needs to provide a unified infrastructure for the business party , Implement hybrid multi Cloud Architecture , Shield the underlying heterogeneous environment . At the same time, reduce the complexity of application operation and maintenance management , Improve the automation of applications , Provide better operation and maintenance experience for business .
How to solve the problem that a single cluster is too large or the number of clusters is too large , The performance and stability risks associated with cluster management . The complexity of cluster life cycle management will increase with the increase of cluster size and number . Take meituan for example , The two places, multi centers and multi clusters scheme we adopted , Although it avoids the hidden danger of excessive cluster size to a certain extent , It solves the problem of business isolation 、 Regional delay and other issues . With the development of edge cluster scenarios and databases PaaS The emergence of cloud requirements on components , It can be predicted that the number of small clusters will have an obvious upward trend . Then comes the complexity of cluster management 、 Monitor configuration costs 、 Significant increase in operation and maintenance costs , At this time, the cluster scheduling system needs to provide more effective operation specifications , And ensure operation safety 、 Alarm self-healing and change efficiency .

Trade offs in designing cluster scheduling system

To address these challenges , A good cluster scheduler will play a key role . But in reality, there is never a perfect system , So when designing the cluster scheduling system , We need to make a choice among several contradictions according to the actual scene ：

System throughput and scheduling quality of cluster scheduling system . System throughput is a very important standard for us to evaluate a system , But in the online service-oriented cluster scheduling system, the more important thing is the scheduling quality . Because the impact of each scheduling result is long-term （ several days 、 Weeks or even months ）, Non abnormal conditions will not be adjusted . So if the scheduling result is wrong , It will directly lead to the increase of service delay . The higher the scheduling quality, the more computational constraints to consider , And the worse the scheduling performance , The lower the system throughput .
The architecture complexity and scalability of cluster scheduling system . System to upper layer PaaS More functions and configurations open to users , Improve the user experience by supporting more functions （ For example, it supports preemptive recovery of application resources and abnormal self-healing of application instances ）, That means the higher the complexity of the system , The more likely subsystems are to conflict .
The reliability of cluster scheduling system and the scale of single cluster . The larger the size of a single cluster , The larger the schedulable range , But the greater the challenge to the reliability of the cluster , Because the explosion radius will increase , The greater the impact of failure . When a single cluster is small , Although it can improve the scheduling concurrency , But the schedulable range becomes smaller , The probability of scheduling failure becomes higher , And the complexity of cluster management becomes larger .

at present , Cluster scheduling systems in the industry are differentiated according to their architectures , It can be divided into single scheduler 、 Two level scheduler 、 Shared state scheduler 、 There are five different architectures: distributed scheduler and hybrid scheduler （ See the picture below 1）, They all made different choices according to their respective scene needs , There is no absolute good or bad .

chart 1 Cluster scheduling system architecture classification （ Excerpt from Malte Schwarzkopf - The evolution of cluster scheduler architectures）

Single scheduler Use complex scheduling algorithms combined with the global information of the cluster , Calculate high-quality placement points , But the delay is high . Such as Google Of Borg System 、 Open source Kubernetes System .
Two level scheduler By separating resource scheduling and job scheduling , Solve the limitations of single scheduler . The two-level scheduler allows different job scheduling logic according to specific applications , At the same time, it maintains the characteristics of sharing cluster resources between different jobs , However, the preemption of high priority applications cannot be realized . A representative system is Apache Mesos and Hadoop YARN.
Shared state scheduler The limitation of two-level scheduler is solved by semi distributed way , Each scheduler in the shared state has a copy of the cluster state , And the scheduler updates the cluster status copy independently . Once the local state copy changes , The status information of the whole cluster will be updated , However, continuous resource contention will lead to the decline of scheduler performance . A representative system is Google Of Omega And Microsoft. Apollo.
Distributed scheduler A relatively simple scheduling algorithm is used to achieve large-scale high throughput 、 Low latency parallel task placement , However, the scheduling algorithm is relatively simple and lacks a global perspective on resource use , It is difficult to achieve high-quality job placement , Representative systems such as the University of California Sparrow.
Hybrid scheduler Spread the workload over centralized and distributed components , Use complex algorithms for long-running tasks , For short-running tasks, it depends on distributed layout . Microsoft Mercury This plan was adopted .

therefore , How to evaluate the quality of a dispatching system , It mainly depends on the actual scheduling scenario . The most widely used in the industry YARN and Kubernetes For example , Although both systems are common resource schedulers , actually YARN Focus on offline batch short tasks ,Kubernetes Focus on long-running services online . In addition to the differences in architecture design and function （Kubernetes It's a single scheduler ,YARN It's a two-level scheduler ）, Their design concepts and perspectives are also different .YARN Focus more on tasks , Focus on resource reuse , Avoid multiple copies of remote data , The goal is to... At a lower cost 、 Perform tasks at a higher speed .Kubernetes Focus more on service status , Focus on peak staggering 、 Service portrait 、 Resource isolation , The goal is to ensure the quality of service .

The evolution of meituan cluster dispatching system

Meituan is in the process of landing and containerization , According to business scenario requirements , The core engine of the cluster scheduling system consists of OpenStack Turn into Kubernetes, And in 2019 By the end of the year, the containerization of online business was completed, and the coverage rate exceeded 98% Our stated goal . But it still faces low resource utilization 、 High operation and maintenance cost ：

The overall resource utilization of the cluster is not high . Such as CPU The average utilization rate of resources is still at the average level of the industry , Compared with other first-line Internet companies, there is a big gap .
The containerization rate of stateful services is not enough , especially MySQL、Elasticsearch And other products do not use containers , There is a large optimization space for business operation and maintenance cost and resource cost .
From the perspective of business needs ,VM The product will exist for a long time ,VM Scheduling and container scheduling are two sets of environments , As a result, the operation and maintenance cost of team virtualization products is high .

therefore , We decided to start the cloud native transformation of the cluster scheduling system . Build a multi cluster management and automatic operation and maintenance capability 、 Support scheduling policy recommendation and self-service configuration 、 Provide cloud native underlying expansion capabilities , A large-scale high availability scheduling system that improves resource utilization on the premise of ensuring application service quality . The core work revolves around maintaining stability 、 lower the cost 、 Three directions of efficiency are proposed to build the dispatching system .

Keep it stable ： Improve the robustness of the dispatching system 、 Observability ; Reduce the coupling between the modules of the system , Reduce complexity ; Improve the automatic operation and maintenance capability of multi cluster management platform ; Optimize the performance of core components of the system ; Ensure the availability of large-scale clusters .
lower the cost ： Deep optimal scheduling model , Open up the link between cluster scheduling and single machine scheduling . From static resource scheduling to dynamic resource scheduling , Introduce offline business container , Form the combination of free competition and strong control , On the premise of ensuring high-quality business application service quality , Improve resource utilization , Reduce IT cost .
Improve efficiency ： Support users to adjust scheduling policies by themselves , Meet the personalized needs of the business , Actively embrace the cloud native field , by PaaS Component provision includes orchestration 、 Dispatch 、 Cross cluster 、 High availability and other core capabilities , Improve operation and maintenance efficiency .

chart 2 Architecture diagram of meituan cluster dispatching system

Final , The architecture of meituan cluster dispatching system is divided into three layers according to the field （ See above 2）, Dispatching platform layer 、 Scheduling policy layer 、 Scheduling engine layer ：

The platform layer is responsible for service access , Access to infrastructure , Encapsulate native interfaces and logic , Provide container management interface （ Capacity expansion 、 to update 、 restart 、 Shrinkage capacity ） And so on .
The policy layer provides multi cluster unified scheduling capability , Continuously optimize scheduling algorithms and strategies , Combined with the service level of the business and sensitive resources and other information , Upgrade through service classification CPU Utilization and distribution rates .
The engine layer provides Kubernetes service , Protect multiple PaaS Cloud native cluster stability of components , And sink the general ability into the orchestration engine , Reduce the access cost of business cloud native landing .

Through fine operation and product function polishing , On the one hand, we have unified the management of nearly one million containers of meituan / Virtual machine instance , On the other hand, the average level of resource utilization in the industry has been raised to the first-class level , It also supports PaaS Containerization of components and cloud native landing .

Multi cluster unified scheduling ： Improve data center resource utilization

Evaluate and assess the quality of cluster dispatching system , Resource utilization is one of the most important indicators . therefore , Although we are 2019 Containerization was completed in , However, containerization is not the purpose , Just means . Our goal is to pass from VM Switch from technology stack to container technology stack , Bring more benefits to users , For example, comprehensively reduce the computing cost of users .

The improvement of resource utilization is limited by the individual hotspot hosts of the cluster , Once capacity expansion , The business container may be expanded to the hotspot host , Business performance indicators such as TP95 Time will fluctuate , So that we can only be like other companies in the industry , Ensure the quality of service by increasing resource redundancy . The reason is ,Kubernetes The allocation method of scheduling engine only considers Request/Limit Quota（Kubernetes Set the request value for the container Request And constraint values Limit, Apply for resource quotas for containers as users ）, It belongs to static resource allocation . As a result, different hosts allocate the same amount of resources , However, due to the service difference of the host, there are great differences in the resource utilization of the host .

In academia and Industry , There are two common methods to solve the contradiction between resource efficiency and application service quality . The first method is to solve the problem from a global perspective through an efficient task scheduler ; The second method is to strengthen the resource isolation between applications through stand-alone resource management . Either way , It means that we need to fully grasp the cluster status , So we did three things ：

The cluster state is established systematically 、 Host status 、 Association of service status , Combined with the dispatching simulation platform , The peak utilization rate and average utilization rate are comprehensively considered , It realizes the prediction and scheduling based on the host historical load and business real-time load .
Through the self-developed dynamic load regulation system and cross cluster rescheduling system , The linkage between cluster scheduling and single machine scheduling link is realized , According to the business classification, the service quality assurance strategies of different resource pools are realized .
After three iterations , Implemented its own cluster Federation service , It solves the problems of resource preemption and state data synchronization , The scheduling concurrency between clusters is improved , The separation of calculation is realized 、 Cluster mapping 、 Load balancing and cross cluster orchestration control （ See the picture below 3）.

chart 3 Cluster Federation V3 Version Architecture

Cluster Federation service version 3 （ chart 3） Split into modules Proxy Layer and the Worker layer , Independent deployment ：

Proxy The layer will synthesize the factors and weights of the cluster state and select the appropriate cluster for scheduling , And choose the right one Worker Distribute the requests .Proxy Module USES etcd Do service registration 、 Choose and discover ,Leader The node is responsible for pre occupying tasks during scheduling , All nodes can be responsible for the query task .
Worker The layer corresponds to a part of the processing Cluster Query request for , When a cluster task is blocked , It can quickly expand the capacity of a corresponding Worker Examples alleviate problems . When the scale of a single cluster is large, it will correspond to multiple clusters Worker example ,Proxy Distribute scheduling requests to multiple Worker Case handling , Improve scheduling concurrency , And reduce every Worker The load of .

Finally, it is scheduled through multiple clusters , We have realized the transition from static resource scheduling model to dynamic resource scheduling model , This reduces the proportion of hot spot hosts , Reduce the proportion of resource fragments , Ensure the service quality of high-quality business applications , The server of the online business cluster CPU The average utilization rate has increased 10 percentage . How to calculate the average utilization rate of cluster resources ：Sum(nodeA.cpu. Currently used audit + nodeB.cpu. Currently used audit + xxx) / Sum(nodeA.cpu. Total number of cores + nodeB.cpu. Total number of cores + xxx）, One point a minute , All values of the day are averaged .

Scheduling engine service ： Empower PaaS Service cloud native landing

Cluster scheduling system not only solves the problem of resource scheduling , It also solves the problem of using computing resources by services . just as 《Software Engineering at Google》 As mentioned in the book , The cluster scheduling system is used as Compute as a Service One of the key components in , We should not only solve the problem of resource scheduling （ From physical machine disassembly to CPU/Mem Such a resource dimension ） Compete with resources （ solve “ Noisy neighbors ”）, Application management also needs to be solved （ Instance automated deployment 、 Environmental monitoring 、 exception handling 、 Number of guarantee service instances 、 Determine the amount of resources required by the business 、 Different types of services, etc ）. And to some extent, application management is more important than resource scheduling , This will directly affect the efficiency of business development, operation and maintenance and the effect of service disaster recovery , After all, the human cost of the Internet is higher than the machine cost .

The containerization of complex stateful applications has always been a difficult problem in the industry , Because distributed systems in these different scenarios usually maintain their own state machines . When the application system is expanded or upgraded , How to ensure the availability of existing instance services , And how to ensure the connectivity between them , It is a much more complex and difficult problem than stateless applications . Although we have containerized stateless Services , But we haven't brought into full play the full value of a good cluster scheduling system . If you want to manage computing resources , The status of services must be managed , Separation of resources and services , Improve service resilience , And this is also Kubernetes What engines are good at .

We are based on meituan optimized and customized Kubernetes edition , Created meituan Kubernetes Engine Services MKE：

Strengthen cluster operation and maintenance capacity , Improved the capacity building of automatic operation and maintenance of clusters , Including cluster self-healing 、 The alarm system 、Event Log analysis, etc , Continuously improve the observability of the cluster .
Set up key business benchmarks , With several important PaaS Component in-depth cooperation , Pain points for users, such as Sidecar Upgrade management 、Operator Grayscale iteration 、 Rapid optimization of alarm separation , Meet the demands of users .
Continuously improve the product experience , Continue to optimize Kubernetes engine , In addition to supporting users to use custom Operator outside , It also provides a general scheduling and scheduling framework （ See the picture 4）, Help users access... At a lower cost MKE, Get a technology bonus .

chart 4 Meituan Kubernetes Engine service scheduling and orchestration framework

In the process of promoting cloud primary landing , An issue of widespread concern is ： be based on Kubernetes Cloud native way to manage stateful applications , What's the difference from building your own management platform before ？

For this question , We need to start from the root of the problem —— Operation and maintenance considerations ：

be based on Kubernetes It means that the system is closed-loop , Don't worry about the data inconsistency between the two systems .
The abnormal response can be in milliseconds , It reduces the cost of the system RTO（Recovery Time Objective, The recovery time goal , It mainly refers to the maximum time that can be tolerated when the business stops serving , It is also the shortest time period from the occurrence of disaster to the recovery of service function of business system ）.
The complexity of operation and maintenance system is also reduced , The service achieves automatic disaster recovery . In addition to the service itself , The configuration and state data that the service depends on can be recovered together .
Compared with the previous PaaS Components “ Chimney type ” Management platform , Generic capabilities can sink into engine services , Reduce development and maintenance costs , And by relying on Engine Services , It can shield the underlying heterogeneous environment , Realize service management across data centers and multi cloud environments .

Future outlook ： Build a cloud native operating system

We think , Cluster management in the cloud native era , From the previous management hardware 、 Resources and other functions have been fully transformed into an application-centered cloud native operating system . With this as the goal , Meituan cluster dispatching system also needs to work from the following aspects ：

Application link delivery management . With the increase of service scale and link complexity , Business depends on PaaS The operation and maintenance complexity of components and underlying infrastructure has long exceeded the general understanding , It is even more difficult for newcomers who have just taken over the project . Therefore, we need to support the business to deliver services through declarative configuration and realize self-operation and maintenance , Provide better operation and maintenance experience for the business , Improve application availability and observability , Reduce the burden of business on underlying resource management .
Edge computing solutions . With the continuous enrichment of meituan business scenarios , Business demand for edge computing nodes is growing , Much faster than expected . We will refer to industry best practices , Form an edge solution suitable for landing in meituan , Provide edge computing node management capabilities for services in need as soon as possible , Achieve cloud edge collaboration .
Capacity building in the Ministry of foreign affairs . There is an upper limit to the improvement of resource utilization of online business clusters , according to Google In the paper 《Borg: the Next Generation》 Disclosed in 2019 Data Center cluster data in , Remove offline tasks , The resource utilization of online tasks is only 30% about , This also shows that there is a greater risk of further upgrading , Input output ratio is not high . follow-up , Meituan cluster dispatching system will continue to explore in the offline mixed department , However, because meituan's offline computer room is relatively independent , Our implementation path will be different from the general scheme in the industry , We'll start with a mix of online services and near real-time tasks , Complete the construction of underlying capabilities , Then explore the mix of online and offline tasks .

summary

Meituan cluster dispatching system is designed , Follow the principle of appropriateness as a whole , In the case of meeting the basic needs of the business , Gradually improve the architecture after ensuring the stability of the system , Improve performance and enrich functions . therefore , We chose ：

In terms of system throughput and scheduling quality, we choose to give priority to meeting the system throughput requirements of business , No excessive pursuit of single scheduling quality , But through rescheduling and adjustment .
In terms of architecture complexity and scalability, we choose to reduce the coupling between various modules of the system , Reduce system complexity , The extended function must be degradable .
In terms of reliability and single cluster size, we choose to control the single cluster size through unified scheduling of multiple clusters , Ensure system reliability , Reduce the explosion radius .

future , We will continue to optimize the cluster scheduling system of iterative meituan according to the same logic , Completely transformed into an application centric cloud native operating system .

Author's brief introduction

Tan Lin , From meituan basic R & D platform / Basic technology department .

Read more technical articles of meituan technical team

| Reply to the official account menu bar dialog box 【2021 necessities 】、【2020 necessities 】、【2019 necessities 】、【2018 necessities 】、【2017 necessities 】 Other keywords , You can view the collection of technical articles of meituan technical team over the years .

| This article is produced by meituan technical team , The copyright belongs to meituan . You are welcome to reprint or use this article for non-commercial purposes such as sharing and communication , Please indicate “ The content is reproduced from meituan technical team ”. This article is without permission , No commercial reprint or use is allowed . Any commercial activity , Please send an email to [email protected] Apply for authorization .