当前位置:网站首页>Metu stability and operation and maintenance guarantee scheme

Metu stability and operation and maintenance guarantee scheme

2022-06-22 19:59:00 Hua Weiyun

author : Wang Guansheng

43.PNG

      One 、 Talk about your work experience in metu
      I am now the senior technical director of Metso , Mainly responsible for smart office 、 Security 、IT And O & M , The operation and maintenance will be subdivided into database operation and maintenance 、SRE、DevOps Wait for the direction ; Before joining metu , Have been on Sina Weibo , Witnessed the development stage of microblog servers from tens of thousands to tens of thousands ;
      I have been in the O & M circle for more than ten years , See that the technology is constantly iterating and developing , We have also invested a lot of energy to make individuals and teams align with the technology of the industry at any time .
      Working experience in metu , It can be roughly divided into three segments :
      The first paragraph :2016 year -2018 year , The beautiful 2016 After years of listed , The company's infrastructure is still in IDC Era , The level of automation is still relatively low , It mainly complements the monitoring system 、 Construction of operation and maintenance system tools , Ensure the efficient delivery and operation of the business ;
      The second paragraph :2018 year -2019 year , In addition to improving the operation and maintenance system , Mainly promote the construction of container platform , Completed the online business 95% Container transformation of ;
      The third paragraph :2019 year - so far , from 19 Year begins , The general trend of the public cloud , We follow the trend , Open the full cloud , Abandon self-sustaining 7 individual IDC,2020 In the middle of this year, the migration of the full volume of business has been completed .

      Two 、 Mito now has more than 2 Million monthly users , As the platform stability leader , What kind of pressure you feel ?
      The user base of Mido is really not small , There are many product lines , It is divided into business lines 、 Commercial and advertising lines 、 Laboratory and middle platform ; For example, we often see beautiful pictures , Beauty Camera , All the beautiful pictures belong to us ToC Business , There are also many ToB The business of .
      If the pressure of stability challenge , At present, it's actually OK , Construction in recent years , The failure rate of the business is significantly reduced , The stability is greatly improved . The current pressure is mainly facing some uncertain Black Swan events , stay oncall And the rapid processing mechanism at the emergency response level .
      Back to stability , In terms of system construction and infrastructure changes :
      System construction : The R & D dimension is like code 、 framework 、 Chaos Engineering 、 The whole link voltage measurement is well implemented ; The operation and maintenance level , In capacity assessment 、 Elastic expansion and contraction 、 Full link monitoring system 、oncall And emergency response , The construction of fault system is also relatively perfect .
      Infrastructure changes : Probably 2019 Before , build by oneself IDC The stage of , Basically, server selection and hosting 、 Network design and construction 、 Infrastructure related tool system development , The development of middleware requires a large amount of human investment , And there are often a lot of tool systems , It's not easy to use ; Especially the user experience is not good , It is difficult to form a good relationship between systems API The standard is for the construction of automation system at a higher level . Finally, when the cloud trend comes , We resolutely decided to embrace the cloud , After starting full cloud deployment , Basically IDC infrastructure , Part of the middleware , Some of the data components are handed over to the public cloud manufacturer ; And we pay more attention to the system construction related to the upper business , Cloud native 、DevOps、SRE、AIops And other industries , Carry out internal personnel transformation , Promote the unreasonable reconstruction of the old system , Automation and stability have been greatly improved .
      3、 ... and 、 What is the operation and maintenance guarantee system of metu ? How to provide users with stable and reliable services ?
      There are many things in the O & M field , The guarantee system for operation and maintenance , Everything starts with faults and problems , To derive a reasonable construction plan , Metso has a complete system to quantitatively evaluate every problem and failure of the business , The performance systems of R & D and operation and maintenance should be evaluated , You can take a rough look at the figure below .

44.PNG

      Back to the overall security system , We are also from the fault prevention 、 Find out 、 location 、 recovery 、 The improvement stage deduces the construction that should be done in each stage , The whole is shown in the picture :

45.PNG

      The main pressure on the current stability mentioned above comes from some uncertain Black Swan events , our SRE A perfect emergency response platform has also been built , Monitor the market from the full link 、 disaster 、 Plan and arrangement 、 Other operation and maintenance intervention actions are unified on the platform for one click operation , Practice from time to time ; In response to major failures and activity support , It can be pulled up quickly warroom Timely response and intervention , You can take a look at our platform .

46.PNG

      Four 、 Mido's O & M team , What do you think is the biggest feature and advantage ?
      1) Keep the team organization small and precise , Like our DBA Need to maintain RDS、Redis、Memcached、kafka、mongodb etc. 10 More than kinds of data resources and middleware . Only equipped with after cloud 3 people ( Including the person in charge ), At the same time, it also needs to develop a unified system based on cloud resources DBaaS Tools .
      2) The team is always sensitive to the advanced technology in the standard industry , And reflect on the construction of each stage , And whatever the role , Must have at least 30% Code ability above man hours .

      5、 ... and 、 Stable operation and maintenance quality is an important support for the development of metu , What expectations and visions do you have for it ? What plans does Mito have for its cloud products in the future ?
      After the business went to the cloud , The overall stable operation and maintenance quality depends on two aspects : Technology and construction accumulation of public cloud manufacturers and their own teams , You can't have one without the other .
      For cloud vendors , Our consistent thinking , Under the control of the service , Make progress together with cloud factory , Gradually evolved into a high-quality cloud ;
      For ourselves , What we need to do in the future is to further develop Yunyuan biochemistry , And will use mature PaaS、SaaS、 even to the extent that Serverless Products ; Everybody knows , For infrastructure , The maturity of cloud products can replace some traditional inherent capabilities to a certain extent , It's a good thing , Although this will force the operation and maintenance mode to make some changes , But we are not anxious , Our thinking is open , Be responsible for the delivery and stability of the final business ;
      I have made a table about the use of cloud products in terms of metaphor and degree of substitution , Here I can share with you :

47.PNG

      In addition to these , We will focus on cloud primitives in the future 2.0、Iac、Gitops etc. , Further improve business stability and delivery efficiency .

      6、 ... and 、 Hua Wei Yun SRE Put forward a set of “ Deterministic operation and maintenance ” The plan , Including product availability improvement , Dynamic risk control and AIOPS Tools and other systems , How do you understand this system ? 
      “ Deterministic operation and maintenance ” The plan , I think it's Huawei cloud SRE A new concept based on my own experience , It can really well interpret the values of Huawei cloud , Everything is customer-centric , Help customers better deal with major events and emergencies , For example, Spring Festival , Support for key activities such as the "double 11" .
      Huawei cloud has passed the risk mining assessment 、 Capacity guarantee 、 Real time monitoring and other measures , Build competitiveness with sustained resilience , Ensure the stable operation of Huawei cloud , Let users experience better quality service , Every year when we guarantee major activities , Can do it 0 fault .
      On the other hand , I personally think Huawei cloud proposed “ Deterministic operation and maintenance ” The plan , In fact, it coincides with our overall stability and operation and maintenance guarantee scheme , The goal is to be responsible for the overall business stability , Only their respective interfaces are different .

      7、 ... and 、 What innovations do you think metu has made in the field of intelligent operation and maintenance ? What can you share with the industry ?
      After the cloud , I think the goal of operation and maintenance is more clear , From fault and stability 、 efficiency 、 Cost has three dimensions , Can be based on public cloud API Ability , Build a tool system at a higher level and focus on the business level , Dare to abandon the traditional operation and maintenance system , Be able to deny yourself , To have a better future .
      We are here , The guidance is very timely . During cloud migration , Students in different positions will be guided accordingly , The transformation of the transformation , The code capability cannot be improved in time , Then build a more perfect platform with limited manpower , Make the whole delivery chain self-help .
      Here's how :
      Organizational structure , Retain DBA、SRE、DevOps Three roles of thinking transformation , Better serve the business .
      Tool construction , No longer need to develop a variety of chimney systems , from All in one Or the idea of domain segmentation , Develop more general and easy-to-use systems , And improve the user experience , After all, if you want to be self-help , First, the delivery and ease of use of tools are high .
      To this end, we take CMDB At the core , Developed simplified CMP System ; With FinOps At the core , The cost decision system is developed MTCC, Systematize all technology related costs of internal and external parties , Indexation , These costs will also be carefully allocated to the profit centers of each product line , Form a definite ROI, It is convenient to optimize cost operation and make better decisions . besides ,DBA Focus on building DBaaS、SRE Focus on building an emergency response platform . Monitoring , Focus on the ability of big data , Create a more unified observable platform and based on AIops Practice , Make prediction and root cause analysis for some scenarios .
      What we are talking about today , Are relatively macro , Each piece can have a chance to carry out a detailed exchange .

原网站

版权声明
本文为[Hua Weiyun]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221832421980.html