当前位置:网站首页>Operation and maintenance specification: process template for online fault handling
Operation and maintenance specification: process template for online fault handling
2022-06-21 23:15:00 【Brother Xing plays with the clouds】
The handling process and documentation when the accident occurs .
Accident handling process
The basic principle of : All means and actions taken during troubleshooting , Business recovery is the highest priority .
Process mechanism
- After the fault is found ,On-Call Of SRE or Operation and maintenance , Fault commander Have the right to convene corresponding business development or other necessary resources , Rapid organization Accident handling team .
- If the problem and recovery process are very clear , Fault commander Still SRE or Operation and maintenance , No transfer , It is up to him to direct everyone to do specific things , Give priority to business recovery .
- If the problem is difficult , It's a big influence , At this time SRE A higher level supervisor can be asked to intervene , such as SRE Supervisor or director, etc , The general principle is who's business is most affected , Who will lead the organization . At this time SRE To put Fault commander Transfer responsibility to higher-level supervisors , If it is the influence of the whole station , Technology if necessary VP or CTO Can also bear Fault commander duty , Or authorize a director to undertake .
- After the problem is solved , Functional verification is required .
Detailed flow chart
```sequence
OnCall Operation and maintenance -> fault : Find fault
OnCall Operation and maintenance ->OnCall Operation and maintenance : Preliminary analysis of fault causes
OnCall Operation and maintenance -> Accident handling team : Gather business development or other necessary resources
Accident handling team -> Accident handling team : Accident feedback (10-15 Minutes at a time )
Accident handling team -> Accident handling : Accident investigation
OnCall Operation and maintenance --> senior executive : The problem is difficult , It's a big influence , Accident escalation
senior executive --> Accident handling team : Full control , Proceed to the next step of negotiation
Accident handling -> Accident handling : Recent releases
Accident handling -> Accident handling : Services and infrastructure
Accident handling -> Accident handling : Solve the problem
Accident handling -> Accident handling team : Investigation record
fault -> Accident recovery : Perform recovery verification
Accident recovery -> Accident handling team : Notification of recovery results
OnCall Operation and maintenance -> Post event summary : Organize the fault recovery meeting
Note right of Post event summary : Summarize the reasons , solve the problem
Post event summary -> Accident handling team : Output meeting summary , Fault report
```
COPYAccident business phenomenon
Who reports what problems at what time , Try to be as detailed as possible , Like equipment id, user id etc.
Frequency of accidents
Episodic or Must appear
Accident recurrence method
Convenient for everyone to reproduce .
Accident time flow record
Record before the accident in the form of event time flow , Operation records in the accident
notes : Time can be accurate
Recorder : ( To be recorded by a designated person )
Time | event | Operator | remarks |
|---|---|---|---|
2021/09/28 12:20:20 | take LB Bandwidth from 10Mb To 20Mb | ||
Accident handling team
An accident group is organized by the accident responders . Easy to communicate .
Set up a special emergency group , Will these The key role of accident products Among them , When a fault occurs, it will be reported to the group as soon as possible .
Accident feedback
It is generally required to take the team as a unit , every other 10~15 Give feedback once a minute , Feedback on current processing progress and next steps Action, If something needs to be done in the middle , Also inform in advance , And the contents of the notification shall include the impact on the business and system , Finally by Fault commander Execute after making a decision , Avoid making mistakes while busy . No progress is progress , Also give timely feedback .
Accident investigation
Recently released information
Can include the last released system commitId, Time , Personnel, etc .
Test feedback
The feedback of the tester on the troubleshooting . It is convenient for developers to check problems .
Time | The test case | result | Recorder | remarks |
|---|---|---|---|---|
9/28 11 P.m. | stay APP On the login | success | Zhang San | |
9/28 11 P.m. | Login on device | Failure | Li Si | |
Service situation
Each service in the team should have a corresponding owner. In case of online failure , Every owner Responsible for checking the service they are responsible for . The documents of the inspection process must retain evidence .
Time | service name | Inspection contents and results | Current state of | Examiner | remarks |
|---|---|---|---|---|---|
9/28 10:30 | echo | values Configure the correct version to apply :v2.1cpu, Memory The year-on-year and month on month comparisons were normal The application configuration is correct ERROR Level from 9/28 7 Point sudden increase . | Zhang San | ||
Infrastructure
The infrastructure team is responsible for checking the infrastructure .
Time | Components | Inspection content | Current state of | Examiner | remarks |
|---|---|---|---|---|---|
9/28 10:00 | LB | Bandwidth packet flow rate | |||
9/28 10:00 | NAT | ||||
9/28 10:00 | Redis | ||||
9/28 10:00 | PostgreSQL | ||||
9/28 11:00 | Domain name resolution |
Accident investigation records
“ hypothesis ” It refers to the assumption made by the troubleshooting personnel about the cause of the fault .
The purpose of this table is to prevent different people from repeatedly checking the same hypothesis . meanwhile , It is also convenient for others to verify .
Time | hypothesis | Investigation method | result | Check the person | remarks |
|---|---|---|---|---|---|
9/28 10:00 | There is a business logic error in the login phase | ||||
Accident recovery
Verification process after accident repair
Restore validation
Whether the business function is normal is verified by the test and the product .
Time | The test case | result | Recorder | remarks |
|---|---|---|---|---|
9/28 11 P.m. | stay APP On the login | success | Zhang San | |
9/28 11 P.m. | Login on device | success | Li Si | |
Post event summary
The second round meeting
There must be a meeting , A chat .
Golden three questions :
- First question : What are the causes of the failure ?
- Second questions : What do we do , What can be done to ensure that similar failures will not occur next time ?
- Third questions : If we did something , Business can be restored in less time ?
Preventive treatment
Since the meeting , There must be a prevention plan .
After the event Action
After the event action It can be combined with Kanban system , Easy to track .action Must be executable , accurate
Action | Executor | The verifier | Schedule completion time | Completion time |
|---|---|---|---|---|
边栏推荐
- About LG (n!) Asymptotically compact supremum of
- [WUSTCTF2020]朴实无华-1
- Uwp shadow effect
- Parallel search exercise 1: circle of friends
- Electronic bidding procurement mall system: optimize traditional procurement business and speed up enterprise digital upgrading
- 4. ESP8266通过OLED实时显示DHT11温湿度参数
- 并查集练习题1:朋友圈
- Uni app advanced style framework / production environment [Day10]
- How to use metric unit buffer in PostGIS
- UniApp之播放视频、 下载视频到手机相册、添加下载进度条功能(踩坑记录)
猜你喜欢

mongo 内存占用过大被系统自动关闭问题

KVM virtual machine rescue mode modifying root password -- the road to building a dream

Better manage all kinds of music, professional DJ music management software pioneer DJ rekordbox

并查集练习题1:朋友圈

Specific methods of using cloud development to realize wechat payment

CISSP certification 2021 textbook OSG 9th Edition added (modified) knowledge points: comparison with the 8th Edition

《MATLAB 神经网络43个案例分析》:第9章 离散Hopfield神经网络的联想记忆——数字识别

Translation software Bob installation tutorial

深入浅出讲解 JS 的微任务与宏任务

Analysis of 43 cases of MATLAB neural network: Chapter 19 handwritten font recognition based on SVM
随机推荐
WPF routing
WPF x:Static
How to use metric unit buffer in PostGIS
花200W买流量,不如0成本起步做独立站私域运营收益高!
pyenv安装anaconda修改清华源
Explain JS micro task and macro task in simple terms
Record the abnormal task status caused by an MQ concurrent consumption
#CISSP认证2021年教材 OSG 第9版 增(改)知识点:与 第8版 的目录对比
The concept of multiprocess and Multithread
Qt中常用的窗体
Uwp tablet inkcanvas
Uniapp solves the cross domain problem of Google browser and runs in Google browser
必讀書籍
牛客月赛-环上食虫
深入浅出讲解 JS 的微任务与宏任务
Left hand code, right hand open source, part of the open source road
mvn deploy多个模块的bat文件
小程序如何关联微信小程序二维码,并实现二码合一聚合
What is the most challenging issue in Bi development?
An error prone to appear when MATLAB is doing image processing: to improve the operation speed, use the pre declared zero matrix to store image data