当前位置:网站首页>Alibaba stability fault emergency handling process

Alibaba stability fault emergency handling process

2022-06-25 12:56:00 Wenxiaowu

One   summary

Although we can build stability system , To avoid production system failure . However, it is still impossible to completely avoid any risk , When stability risks arise , How to coordinate and organize quickly , Shorten the fault time , Scientific process is very important .

Fortunately, if we start thinking now , We still have enough time to design each link , And let the students participate in full exercise , So as to be well-trained , Buy valuable time for fault recovery .

Two   Structured problem solving

There are many structured solutions for problem solving , Especially various professional consulting companies , These processes are worth learning from . In combination with the production environment fault of the software system , A typical structured problem solving step is as follows :

  • Problem definition : Clearly describe the problem phenomenon 、 influence , The impact should be quantified as much as possible . for example xx when xx Point start ,xx Service exception , Success rate from 99% depreciate to 90%.

  • To settle temporarily : Temporary solutions and implementation results based on the plan , Including the implementation of qualified plans , Or roll back immediately after applying the exception in the publishing process .

  • Analyze the cause of the problem : Combined with known factors , Find the root cause of the problem .

  • Develop solutions .

  • Implement solutions .

  • Standardized solutions : Standardize solutions , The lines , Avoid similar problems .

In production environment , When there is a sudden abnormality , Our first priority is to consider how to quickly restore services , Therefore, this article focuses on the front of the above process 2 A step .

in addition , Problem solving , Communication runs through the whole process . It is necessary to fully communicate in all links .

3、 ... and   Key characters

Sudden abnormalities are different , It is difficult to have a completely unified and fine-grained standard process , But several key roles can be agreed in advance , Define the role and key actions of the character , To improve collaboration efficiency .

It mainly includes these roles :

  • commander : Be responsible for organizing and coordinating rapid fault recovery 、 Report relevant progress in the fault group .

  • correspondent : Responsible for collection 、 Record key information , And communicate with relevant teams through fault groups and other channels .

  • The person in charge of quick recovery : According to the fault phenomenon 、 Monitor the market , Make decisions and implement plans .

  • The person in charge of problem diagnosis : Locate the root cause of the failure , When fast recovery doesn't work , This role is crucial .

The following is a detailed description of each role .

1  commander

Commander selection

  • First responder : By default, the first one to receive the alarm 、 The technicians who complain and feed back serve as the commander . The first alarm receiver judges whether he can command , Or whether there is a plan that you are familiar with and fully rehearsed , If you can, restore service immediately , Otherwise, contact the full-time commander to take over . Before the full-time commander takes over , The first alarm receiver is the default commander .

  • Full time commander : The team Leader The person in charge of and stability is the best commander of most risks , When the emergency team establishes contact , The commander can be handed over to TL Or the person in charge of stability within the team .

  • all or different levels TL: When the fault duration and level continue to rise , According to the actual situation, it will rise , From a higher level TL Take over the role of commander , To coordinate more resources to join .

The commander's key actions

  • Identify the problem : Determine the phenomenon of the emergency 、 influence .

  • Identify roles : Identify the key roles involved in this event , Including the correspondent 、 The person in charge of quick recovery 、 The person in charge of problem diagnosis .

  • Communicate up : Make this issue known to key players in the organization , So when needed , More people and resources can be mobilized faster .

  • Coordinate : Assist quick recovery leader and problem diagnosis leader to solve problems , In information 、 Provide support on resources such as domain experts .

Requirements for commanders

  • start-up : Identify people , And through video conference 、 Establish an emergency team by means of fault group .

  • In the early : Keep an eye on the progress of the person in charge of fast recovery , Give priority to landing and quick recovery , Instead of analyzing the root cause . When fast recovery does not take effect , We should also continue to explore possible means of rapid recovery , For example, rollback of recent changes . The past fault duration is not satisfied 1-5-10 Case study , In most cases, the commander is analyzing the root cause of the problem , Missed the best time to recover quickly .

  • Mid - : If you can't restore service by trying a lot of means , The focus gradually shifted to the person in charge of problem diagnosis , Find the root cause . Usually, if the fault has not recovered at this stage , It's just a big fault ,1-5-10 Basically, it can not reach the standard .

  • later stage : Organize the team to continue to observe , Confirm that the problem will not recur . Organize the aftermath and rehabilitation work .

2  correspondent

If the fault cannot be recovered through the plan at the first time , The messenger will be second only to the commander . Organize information collection efficiently 、 Arrangement , It will make the whole emergency team find solutions faster .

The correspondent chooses

  • Full time correspondent : Have a certain degree of stability in the team , Then it is usually not the first candidate for the person in charge of quick recovery and the person in charge of problem diagnosis .

  • Other team members who are not involved in problem diagnosis and quick recovery .

Correspondent key actions

  • Continuous problem identification and notification : Over time , Problem phenomenon 、 The influence surface is also changing dynamically , Regular notification is required ( Fault group 、 Teleconference and other channels ), In the early stage, we should do 5 Change the briefing every minute , Over time , Later, it can be changed to 15 minute 、30 Minutes apart .

  • information gathering : According to the standard template , Establish a unified document for the problem , Put document links in group announcements 、 In the fault group . And continuously update the key information collected . It is convenient for subsequent students who join the emergency group to quickly understand the context .

  • Collect public opinion : This overlaps with information gathering , The reason why it is particularly emphasized , Because this link is usually easy to be ignored , Technical students tend to get caught up in technical indicators , Lack of attention to public opinion .

  • Speak to the outside : Contact the person in charge of customer service , Work with the customer service team , Reassure customers .

Key requirements for correspondents

  • Be quick in the early stage : Quickly collect key information , gold 10 Information should be updated every minute within minutes , And keep it informed .

  • Timely notification : A good notification is to inform the next notification time , for example xx problem yy Processing , The present situation is zzz,xx The next briefing will be held in minutes . If there is a reliable and timely notification , People who are concerned about this issue only need to keep an eye on the information bulletin , Avoid unprofessional interference that affects the rapid response of the emergency response team .

  • Contact external support : When external dependent parties are involved , for example OSS、MySQL etc. , Through the commander 、 application Owner When the channel knows the external contact person , Organize external contact persons to join the emergency response team in time , And inform the other party of the problem context .

3  The person in charge of quick recovery

Our expectation is that all risks can be solved through quick recovery , If not , It is also the first time to explore other feasible quick recovery schemes ( Such as rollback ).

The person in charge of quick recovery chooses

  • application Owner/ The core backbone .

  • Team members who have implemented the application plan : We encourage teams to cross execute the plan , When an application Owner When you can't get in touch , Other students can also help with problem recovery through the plan .

Key actions of quick recovery leader

  • Implement the quick recovery plan : According to the problem phenomenon , Find the market of the plan , Implement the corresponding plan according to the monitoring indicators on the market .

  • Develop other candidate recovery plans : When the known rapid recovery plan does not take effect , Analyze possible changes and other factors , Try to recover by rollback, etc . When necessary , Let the commander coordinate more people to support .

Key requirements of quick recovery principal

  • Restore service as the first priority , Please submit the root cause analysis to the person in charge of problem diagnosis .

  • The established plan cannot be recovered quickly , We should also continue to explore other possible means of recovery .

4  The person in charge of problem diagnosis

Usually we don't want this person to be in trouble 1-5-10 The recovery phase of , But when the fast recovery fails and there is no effective means to restore the service in a short time , Finally, the root cause can only be found by the person in charge of problem diagnosis , And develop solutions .

The person in charge of problem diagnosis selects

  • application Owner/ The backbone : People who know the relevant code are most suitable for problem diagnosis .

  • Domain expert : For example, network problems , Experts in this field can be found from the group to assist in participating .

Key requirements of the problem diagnostician

  • According to the information collected , Find the root cause of the problem .

  • To the commander 、 Request from correspondent , Invite external support to the emergency team .

Four   Last

Fault emergency response is the last opportunity to maintain high availability of the system , The unprofessional performance of this link , For stability, it is the last complete failure . therefore , Just like the plan drill , Failure emergency also needs to focus on training . Some opportunities to exercise include :

  • Real fault scenarios .

  • Red blue confrontation exercise : And SRE linkage , By surprise , Simulate a fault .

  • General alarm upgrade :TL Or the stability director randomly selects a SMS alarm , It is artificially upgraded to failure , Enter the fault emergency response process .

原网站

版权声明
本文为[Wenxiaowu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206251209320530.html