当前位置:网站首页>Metu stability and operation and maintenance guarantee scheme
Metu stability and operation and maintenance guarantee scheme
2022-06-22 19:59:00 【Hua Weiyun】
author : Wang Guansheng

One 、 Talk about your work experience in metu
I am now the senior technical director of Metso , Mainly responsible for smart office 、 Security 、IT And O & M , The operation and maintenance will be subdivided into database operation and maintenance 、SRE、DevOps Wait for the direction ; Before joining metu , Have been on Sina Weibo , Witnessed the development stage of microblog servers from tens of thousands to tens of thousands ;
I have been in the O & M circle for more than ten years , See that the technology is constantly iterating and developing , We have also invested a lot of energy to make individuals and teams align with the technology of the industry at any time .
Working experience in metu , It can be roughly divided into three segments :
The first paragraph :2016 year -2018 year , The beautiful 2016 After years of listed , The company's infrastructure is still in IDC Era , The level of automation is still relatively low , It mainly complements the monitoring system 、 Construction of operation and maintenance system tools , Ensure the efficient delivery and operation of the business ;
The second paragraph :2018 year -2019 year , In addition to improving the operation and maintenance system , Mainly promote the construction of container platform , Completed the online business 95% Container transformation of ;
The third paragraph :2019 year - so far , from 19 Year begins , The general trend of the public cloud , We follow the trend , Open the full cloud , Abandon self-sustaining 7 individual IDC,2020 In the middle of this year, the migration of the full volume of business has been completed .
Two 、 Mito now has more than 2 Million monthly users , As the platform stability leader , What kind of pressure you feel ?
The user base of Mido is really not small , There are many product lines , It is divided into business lines 、 Commercial and advertising lines 、 Laboratory and middle platform ; For example, we often see beautiful pictures , Beauty Camera , All the beautiful pictures belong to us ToC Business , There are also many ToB The business of .
If the pressure of stability challenge , At present, it's actually OK , Construction in recent years , The failure rate of the business is significantly reduced , The stability is greatly improved . The current pressure is mainly facing some uncertain Black Swan events , stay oncall And the rapid processing mechanism at the emergency response level .
Back to stability , In terms of system construction and infrastructure changes :
System construction : The R & D dimension is like code 、 framework 、 Chaos Engineering 、 The whole link voltage measurement is well implemented ; The operation and maintenance level , In capacity assessment 、 Elastic expansion and contraction 、 Full link monitoring system 、oncall And emergency response , The construction of fault system is also relatively perfect .
Infrastructure changes : Probably 2019 Before , build by oneself IDC The stage of , Basically, server selection and hosting 、 Network design and construction 、 Infrastructure related tool system development , The development of middleware requires a large amount of human investment , And there are often a lot of tool systems , It's not easy to use ; Especially the user experience is not good , It is difficult to form a good relationship between systems API The standard is for the construction of automation system at a higher level . Finally, when the cloud trend comes , We resolutely decided to embrace the cloud , After starting full cloud deployment , Basically IDC infrastructure , Part of the middleware , Some of the data components are handed over to the public cloud manufacturer ; And we pay more attention to the system construction related to the upper business , Cloud native 、DevOps、SRE、AIops And other industries , Carry out internal personnel transformation , Promote the unreasonable reconstruction of the old system , Automation and stability have been greatly improved .
3、 ... and 、 What is the operation and maintenance guarantee system of metu ? How to provide users with stable and reliable services ?
There are many things in the O & M field , The guarantee system for operation and maintenance , Everything starts with faults and problems , To derive a reasonable construction plan , Metso has a complete system to quantitatively evaluate every problem and failure of the business , The performance systems of R & D and operation and maintenance should be evaluated , You can take a rough look at the figure below .

Back to the overall security system , We are also from the fault prevention 、 Find out 、 location 、 recovery 、 The improvement stage deduces the construction that should be done in each stage , The whole is shown in the picture :

The main pressure on the current stability mentioned above comes from some uncertain Black Swan events , our SRE A perfect emergency response platform has also been built , Monitor the market from the full link 、 disaster 、 Plan and arrangement 、 Other operation and maintenance intervention actions are unified on the platform for one click operation , Practice from time to time ; In response to major failures and activity support , It can be pulled up quickly warroom Timely response and intervention , You can take a look at our platform .

Four 、 Mido's O & M team , What do you think is the biggest feature and advantage ?
1) Keep the team organization small and precise , Like our DBA Need to maintain RDS、Redis、Memcached、kafka、mongodb etc. 10 More than kinds of data resources and middleware . Only equipped with after cloud 3 people ( Including the person in charge ), At the same time, it also needs to develop a unified system based on cloud resources DBaaS Tools .
2) The team is always sensitive to the advanced technology in the standard industry , And reflect on the construction of each stage , And whatever the role , Must have at least 30% Code ability above man hours .
5、 ... and 、 Stable operation and maintenance quality is an important support for the development of metu , What expectations and visions do you have for it ? What plans does Mito have for its cloud products in the future ?
After the business went to the cloud , The overall stable operation and maintenance quality depends on two aspects : Technology and construction accumulation of public cloud manufacturers and their own teams , You can't have one without the other .
For cloud vendors , Our consistent thinking , Under the control of the service , Make progress together with cloud factory , Gradually evolved into a high-quality cloud ;
For ourselves , What we need to do in the future is to further develop Yunyuan biochemistry , And will use mature PaaS、SaaS、 even to the extent that Serverless Products ; Everybody knows , For infrastructure , The maturity of cloud products can replace some traditional inherent capabilities to a certain extent , It's a good thing , Although this will force the operation and maintenance mode to make some changes , But we are not anxious , Our thinking is open , Be responsible for the delivery and stability of the final business ;
I have made a table about the use of cloud products in terms of metaphor and degree of substitution , Here I can share with you :

In addition to these , We will focus on cloud primitives in the future 2.0、Iac、Gitops etc. , Further improve business stability and delivery efficiency .
6、 ... and 、 Hua Wei Yun SRE Put forward a set of “ Deterministic operation and maintenance ” The plan , Including product availability improvement , Dynamic risk control and AIOPS Tools and other systems , How do you understand this system ?
“ Deterministic operation and maintenance ” The plan , I think it's Huawei cloud SRE A new concept based on my own experience , It can really well interpret the values of Huawei cloud , Everything is customer-centric , Help customers better deal with major events and emergencies , For example, Spring Festival , Support for key activities such as the "double 11" .
Huawei cloud has passed the risk mining assessment 、 Capacity guarantee 、 Real time monitoring and other measures , Build competitiveness with sustained resilience , Ensure the stable operation of Huawei cloud , Let users experience better quality service , Every year when we guarantee major activities , Can do it 0 fault .
On the other hand , I personally think Huawei cloud proposed “ Deterministic operation and maintenance ” The plan , In fact, it coincides with our overall stability and operation and maintenance guarantee scheme , The goal is to be responsible for the overall business stability , Only their respective interfaces are different .
7、 ... and 、 What innovations do you think metu has made in the field of intelligent operation and maintenance ? What can you share with the industry ?
After the cloud , I think the goal of operation and maintenance is more clear , From fault and stability 、 efficiency 、 Cost has three dimensions , Can be based on public cloud API Ability , Build a tool system at a higher level and focus on the business level , Dare to abandon the traditional operation and maintenance system , Be able to deny yourself , To have a better future .
We are here , The guidance is very timely . During cloud migration , Students in different positions will be guided accordingly , The transformation of the transformation , The code capability cannot be improved in time , Then build a more perfect platform with limited manpower , Make the whole delivery chain self-help .
Here's how :
Organizational structure , Retain DBA、SRE、DevOps Three roles of thinking transformation , Better serve the business .
Tool construction , No longer need to develop a variety of chimney systems , from All in one Or the idea of domain segmentation , Develop more general and easy-to-use systems , And improve the user experience , After all, if you want to be self-help , First, the delivery and ease of use of tools are high .
To this end, we take CMDB At the core , Developed simplified CMP System ; With FinOps At the core , The cost decision system is developed MTCC, Systematize all technology related costs of internal and external parties , Indexation , These costs will also be carefully allocated to the profit centers of each product line , Form a definite ROI, It is convenient to optimize cost operation and make better decisions . besides ,DBA Focus on building DBaaS、SRE Focus on building an emergency response platform . Monitoring , Focus on the ability of big data , Create a more unified observable platform and based on AIops Practice , Make prediction and root cause analysis for some scenarios .
What we are talking about today , Are relatively macro , Each piece can have a chance to carry out a detailed exchange .
边栏推荐
猜你喜欢

0.0 - how can SolidWorks be uninstalled cleanly?

Compilation error: /usr/bin/ld: /usr/local/lib/libgflags a(gflags.cc.o): relocation R_ X86_ 64_ 32S against `. rodata‘

2.什么是机械设计?

使用 qrcodejs2 生成二维码详细API和参数
![[nfs无法挂载问题] mount.nfs: access denied by server while mounting localhost:/data/dev/mysql](/img/15/cbb95ec823cdde5fb8f032dc45cfc7.png)
[nfs无法挂载问题] mount.nfs: access denied by server while mounting localhost:/data/dev/mysql

Yarn notes

华为云招募工业智能领域合作伙伴,强力扶持+商业变现

C #, introductory tutorial -- a little knowledge about function parameter ref and source program

知识蒸馏之Focal and Global Knowledge Distillation for Detectors

详解openGauss多线程架构启动过程
随机推荐
Online generation of placeholder pictures
商业智能BI数据仓库中的指标、维度和模型到底是什么?
The custom control autoscalemode causes the problem of increasing the width of font
1.2-----机械设计工具(CAD软件)和硬件设计工具(EDA软件)及对比
Human Pose Estimation浅述
第一章 力扣热题100道(1-5)
Focal and global knowledge distillation for detectors
老师们,我想请教一个问题,我本地跑flinkcdc同步mysql数据timestamp字段解析正常,
知识蒸馏之Focal and Global Knowledge Distillation for Detectors
2. what is mechanical design?
Array objects can be compared one by one (the original data with the same index and ID will be retained, and the data not in the original array will be added from the default list)
【深入理解TcaplusDB技术】TcaplusDB运维——日常巡检
Recommend an anatomy website
1.4----- PCB design? (circuit design) determination scheme
1.3----- simple setting of 3D slicing software
区间检索SQL性能优化方法
[in depth understanding of tcapulusdb technology] tcapulusdb model
Initial experience of ABAQUS using RSG drawing plug-in
k8s部署mysql
记可视化项目代码设计的心路历程以及理解