当前位置:网站首页>Alibaba stability fault emergency handling process
Alibaba stability fault emergency handling process
2022-06-25 12:56:00 【Wenxiaowu】
One summary
Although we can build stability system , To avoid production system failure . However, it is still impossible to completely avoid any risk , When stability risks arise , How to coordinate and organize quickly , Shorten the fault time , Scientific process is very important .
Fortunately, if we start thinking now , We still have enough time to design each link , And let the students participate in full exercise , So as to be well-trained , Buy valuable time for fault recovery .
Two Structured problem solving
There are many structured solutions for problem solving , Especially various professional consulting companies , These processes are worth learning from . In combination with the production environment fault of the software system , A typical structured problem solving step is as follows :
Problem definition : Clearly describe the problem phenomenon 、 influence , The impact should be quantified as much as possible . for example xx when xx Point start ,xx Service exception , Success rate from 99% depreciate to 90%.
To settle temporarily : Temporary solutions and implementation results based on the plan , Including the implementation of qualified plans , Or roll back immediately after applying the exception in the publishing process .
Analyze the cause of the problem : Combined with known factors , Find the root cause of the problem .
Develop solutions .
Implement solutions .
Standardized solutions : Standardize solutions , The lines , Avoid similar problems .
In production environment , When there is a sudden abnormality , Our first priority is to consider how to quickly restore services , Therefore, this article focuses on the front of the above process 2 A step .
in addition , Problem solving , Communication runs through the whole process . It is necessary to fully communicate in all links .
3、 ... and Key characters
Sudden abnormalities are different , It is difficult to have a completely unified and fine-grained standard process , But several key roles can be agreed in advance , Define the role and key actions of the character , To improve collaboration efficiency .
It mainly includes these roles :
commander : Be responsible for organizing and coordinating rapid fault recovery 、 Report relevant progress in the fault group .
correspondent : Responsible for collection 、 Record key information , And communicate with relevant teams through fault groups and other channels .
The person in charge of quick recovery : According to the fault phenomenon 、 Monitor the market , Make decisions and implement plans .
The person in charge of problem diagnosis : Locate the root cause of the failure , When fast recovery doesn't work , This role is crucial .
The following is a detailed description of each role .
1 commander
Commander selection
First responder : By default, the first one to receive the alarm 、 The technicians who complain and feed back serve as the commander . The first alarm receiver judges whether he can command , Or whether there is a plan that you are familiar with and fully rehearsed , If you can, restore service immediately , Otherwise, contact the full-time commander to take over . Before the full-time commander takes over , The first alarm receiver is the default commander .
Full time commander : The team Leader The person in charge of and stability is the best commander of most risks , When the emergency team establishes contact , The commander can be handed over to TL Or the person in charge of stability within the team .
all or different levels TL: When the fault duration and level continue to rise , According to the actual situation, it will rise , From a higher level TL Take over the role of commander , To coordinate more resources to join .
The commander's key actions
Identify the problem : Determine the phenomenon of the emergency 、 influence .
Identify roles : Identify the key roles involved in this event , Including the correspondent 、 The person in charge of quick recovery 、 The person in charge of problem diagnosis .
Communicate up : Make this issue known to key players in the organization , So when needed , More people and resources can be mobilized faster .
Coordinate : Assist quick recovery leader and problem diagnosis leader to solve problems , In information 、 Provide support on resources such as domain experts .
Requirements for commanders
start-up : Identify people , And through video conference 、 Establish an emergency team by means of fault group .
In the early : Keep an eye on the progress of the person in charge of fast recovery , Give priority to landing and quick recovery , Instead of analyzing the root cause . When fast recovery does not take effect , We should also continue to explore possible means of rapid recovery , For example, rollback of recent changes . The past fault duration is not satisfied 1-5-10 Case study , In most cases, the commander is analyzing the root cause of the problem , Missed the best time to recover quickly .
Mid - : If you can't restore service by trying a lot of means , The focus gradually shifted to the person in charge of problem diagnosis , Find the root cause . Usually, if the fault has not recovered at this stage , It's just a big fault ,1-5-10 Basically, it can not reach the standard .
later stage : Organize the team to continue to observe , Confirm that the problem will not recur . Organize the aftermath and rehabilitation work .
2 correspondent
If the fault cannot be recovered through the plan at the first time , The messenger will be second only to the commander . Organize information collection efficiently 、 Arrangement , It will make the whole emergency team find solutions faster .
The correspondent chooses
Full time correspondent : Have a certain degree of stability in the team , Then it is usually not the first candidate for the person in charge of quick recovery and the person in charge of problem diagnosis .
Other team members who are not involved in problem diagnosis and quick recovery .
Correspondent key actions
Continuous problem identification and notification : Over time , Problem phenomenon 、 The influence surface is also changing dynamically , Regular notification is required ( Fault group 、 Teleconference and other channels ), In the early stage, we should do 5 Change the briefing every minute , Over time , Later, it can be changed to 15 minute 、30 Minutes apart .
information gathering : According to the standard template , Establish a unified document for the problem , Put document links in group announcements 、 In the fault group . And continuously update the key information collected . It is convenient for subsequent students who join the emergency group to quickly understand the context .
Collect public opinion : This overlaps with information gathering , The reason why it is particularly emphasized , Because this link is usually easy to be ignored , Technical students tend to get caught up in technical indicators , Lack of attention to public opinion .
Speak to the outside : Contact the person in charge of customer service , Work with the customer service team , Reassure customers .
Key requirements for correspondents
Be quick in the early stage : Quickly collect key information , gold 10 Information should be updated every minute within minutes , And keep it informed .
Timely notification : A good notification is to inform the next notification time , for example xx problem yy Processing , The present situation is zzz,xx The next briefing will be held in minutes . If there is a reliable and timely notification , People who are concerned about this issue only need to keep an eye on the information bulletin , Avoid unprofessional interference that affects the rapid response of the emergency response team .
Contact external support : When external dependent parties are involved , for example OSS、MySQL etc. , Through the commander 、 application Owner When the channel knows the external contact person , Organize external contact persons to join the emergency response team in time , And inform the other party of the problem context .
3 The person in charge of quick recovery
Our expectation is that all risks can be solved through quick recovery , If not , It is also the first time to explore other feasible quick recovery schemes ( Such as rollback ).
The person in charge of quick recovery chooses
application Owner/ The core backbone .
Team members who have implemented the application plan : We encourage teams to cross execute the plan , When an application Owner When you can't get in touch , Other students can also help with problem recovery through the plan .
Key actions of quick recovery leader
Implement the quick recovery plan : According to the problem phenomenon , Find the market of the plan , Implement the corresponding plan according to the monitoring indicators on the market .
Develop other candidate recovery plans : When the known rapid recovery plan does not take effect , Analyze possible changes and other factors , Try to recover by rollback, etc . When necessary , Let the commander coordinate more people to support .
Key requirements of quick recovery principal
Restore service as the first priority , Please submit the root cause analysis to the person in charge of problem diagnosis .
The established plan cannot be recovered quickly , We should also continue to explore other possible means of recovery .
4 The person in charge of problem diagnosis
Usually we don't want this person to be in trouble 1-5-10 The recovery phase of , But when the fast recovery fails and there is no effective means to restore the service in a short time , Finally, the root cause can only be found by the person in charge of problem diagnosis , And develop solutions .
The person in charge of problem diagnosis selects
application Owner/ The backbone : People who know the relevant code are most suitable for problem diagnosis .
Domain expert : For example, network problems , Experts in this field can be found from the group to assist in participating .
Key requirements of the problem diagnostician
According to the information collected , Find the root cause of the problem .
To the commander 、 Request from correspondent , Invite external support to the emergency team .
Four Last
Fault emergency response is the last opportunity to maintain high availability of the system , The unprofessional performance of this link , For stability, it is the last complete failure . therefore , Just like the plan drill , Failure emergency also needs to focus on training . Some opportunities to exercise include :
Real fault scenarios .
Red blue confrontation exercise : And SRE linkage , By surprise , Simulate a fault .
General alarm upgrade :TL Or the stability director randomly selects a SMS alarm , It is artificially upgraded to failure , Enter the fault emergency response process .
边栏推荐
- Ramda rejects objects with null and empty object values in the data
- Jupyter Notebook主题字体设置及自动代码补全
- @Scheduled implementation of scheduled tasks (concurrent execution of multiple scheduled tasks)
- [data visualization] 360 ° teaching you how to comprehensively learn visualization - Part 1
- Elemntui's select+tree implements the search function
- Maximum number [abstract rules for abstract sorting]
- RESTful和RPC
- Select randomly by weight [prefix and + dichotomy + random target]
- STM32 stores float data in flash
- Geospatial search: implementation principle of KD tree
猜你喜欢
Meichuang was selected into the list of "2022 CCIA top 50 Chinese network security competitiveness"
PPT绘图之AI助力论文图
Render values to corresponding text
The editor is used every day. What is the working principle of language service protocol?
冷启动的最优解决方案
Connect with the flight book and obtain the user information according to the userid
Serevlt初识
为何数据库也云原生了?
My first experience of go+ language -- a collection of notes on learning go+ design architecture
几分钟上线一个网站 真是神器
随机推荐
剑指 Offer II 025. 链表中的两数相加
node. JS architecture optimization: reverse proxy and cache service
Resolution of PPT paper drawing
MySQL adds, modifies, and deletes table fields, field data types, and lengths (with various actual case statements)
出手即不凡,这很 Oracle!
C program linking SQLSERVER database: instance failed
@Scheduled implementation of scheduled tasks (concurrent execution of multiple scheduled tasks)
量化交易之回测篇 - 期货CTA策略实例(TQZFutureRenkoScalpingStrategy)
地理空间搜索 ->R树索引
Lexical trap
Negative sample image used in yolov5 training
Draw the satellite sky map according to the azimuth and elevation of the satellite (QT Implementation)
线上服务应急攻关方法论
20220620 面试复盘
You can't specify target table 'xxx' for update in from clause
Foreach method of array in JS
Go novice exploration road 1
为何数据库也云原生了?
Drawing cubes with Visio
Elemntui's select+tree implements the search function