当前位置：网站首页>What is SRE? A detailed explanation of SRE operation and maintenance system

What is SRE? A detailed explanation of SRE operation and maintenance system

2022-06-24 08:47:00 【JackieZhengChina】

Within any enterprise of a certain scale , Once it's implemented, the whole thing SRE The operation and maintenance mode of , Then the construction of observability system will become particularly important , And in the whole observability system .

Observability system

Indicator monitoring ： That is, monitoring of various indicators , For example, basic resource indicators , Service performance indicators , Business call indicators .
journal ： Operation log monitoring of various devices and services .
Call chain ： Business level call chain analysis , Usually in a distributed system to help operations 、 Development and operation and maintenance personnel quickly identify the bottleneck of the overall call

A whole set of observable systems , It ensures that you have insight into the system , Track the health of the system 、 Availability and what happens inside the system .

For the construction of the whole observable system , Two things need to be noted ：

Determine what quality standards are , And ensure that the system continues to approach or remain within the limits of the quality standard
Focus on the work systematically — Instead of just looking at the system at random

In the whole enterprise level observable system , I think it should at least include the following features ：

Complete index collection ： Can connect most of the equipment and technology stack in the enterprise corresponding monitoring indicators ; meanwhile , Support the monitoring index system of common equipment , It can quickly access monitoring equipment and indicators , Avoid that all device monitoring is built from scratch ; For log data collection support
Massive device support ： Enterprises IT The number and scale of systems are growing , Therefore, the monitoring system needs to monitor a large number of devices .
Monitoring data storage and Analysis ： Monitoring data is operation and maintenance analysis 、 The foundation of operation and maintenance automation and intelligence , Therefore, mass monitoring data storage and visual analysis based on monitoring data are the basic capabilities of a monitoring system .
Observable system is the basis of the whole operation and maintenance system , It needs to provide data support for the entire operation and maintenance system .

therefore , An enterprise level observability system should be platform based . On the one hand, more can be realized through configuration or development Access to operation and maintenance indicators ; On the other hand , We can also connect more professional operation and maintenance tools , Integrate and get through multiple operation and maintenance data , Provide data services for more operation and maintenance scenarios . On the whole , Observability system provides a data base for enterprise operation and maintenance , Let's use more data for accident response and capacity prediction rather than relying on past experience and brain beating to make decisions .

Fault response

If something goes wrong , How to remind people and respond ？ Tools can help solve this problem , For it can define the rules that remind people .

Fault response is based on data built using observability systems , And with the help of feedback loops , To help us strengthen our monitoring of services .

Fault response usually includes the following actions :

Focus on : Whether it is the initiative to find bottlenecks or outliers , Or passively exposing bottlenecks through observability systems , We should all take the initiative to pay attention to
communication : Timely inform relevant parties of the observed risk points , And inform the affected area and relevant remedial measures
recovery : After the three parties reach an agreement , Repair relevant risk points and abnormal points according to remedial measures

It should be noted that , If the whole observability system can do well in the early stage , Usually, the fault should start with a simple warning message or a call for trouble , therefore , Usually , The observable system is good enough to only play the role of traceability and investigation , But it can't be discovered in time , At this time, we need to rely on the observation data to calculate and evaluate the alarm , Timely inform the relevant person of the relevant alarm , To expose risk points .

Alarm is only the first link in the whole fault response , The solution is how to find the fault , And most of the fault response work is about defining processing strategies and providing training , So that people know what to do when they get an alarm , Usually this part is the summary and precipitation of past historical experience and operation and maintenance experience , Including some abstract and instrumental precipitation of experience , To ensure the efficiency and generalization of fault response ( It doesn't depend on human experience ).

For the whole system , What needs to be ensured is the effectiveness of the alarm , otherwise , The whole alarm system is likely to degenerate into a garbage data generator , Alarm effectiveness means that the following two requirements need to be met ：

Alarm timeliness : If there is a problem in the system, it is necessary to inform the operation and maintenance processing personnel to handle the alarm in time through the alarm information ;
Alarm accuracy : As long as there is an alarm information system, there must be problems ( For many enterprises, there may be a lot of useless alarms , For example, disk problems ,mem And so on , Of course, it's about automation 、 Business form 、 Alarm threshold problem );

Throughout the operation and maintenance process , We often find a lot of irrelevant warning information , Let the operation and maintenance personnel lose their attention in the warning ocean , Usually, the leaders in the non operation and maintenance field will pay attention to the response degree of the whole alarm , therefore , Suppress and eliminate invalid alarms , Let the operation and maintenance personnel not be engulfed by the warning storm , It is also the key construction content in alarm management .

General situation , After the construction of our observable systems , Can be integrated into the monitoring platform of various monitoring data , Apply trend forecast 、 Short period detection 、 Intermittent recovery 、 Baseline judgment 、 Repeat compression and other algorithms and means to achieve alarm compression convergence , Enhance the effectiveness of alarms .

meanwhile , For front line operation and maintenance personnel , We need to conduct comprehensive modeling and analysis according to multiple monitoring indicators of the same system or equipment , Sum it up to a health score , The system hierarchical evaluation system based on the health degree for the front-line operation and maintenance personnel system , real 、 It can directly reflect the running state of the system , Achieve quick problem demarcation .

such as , Through the comprehensive weighted calculation of multiple indicators of basic resources to evaluate the overall utilization rate of the resources ; Through the resource utilization rate of all resources associated with an application and the overall modeling and analysis of the operation and maintenance architecture of the application, a score is calculated to evaluate the overall health degree of the application .

If we do this process more mature , According to the existing solutions and alarms, the closed-loop connection can be carried out , A simple scenario is , When the disk is full , The alarm will first trigger a standardized disk tour , And delete the relevant discardable data , If you still can't solve the alarm , Next time, it can be directly related to the first-line operation and maintenance for manual intervention , After that, we summarize the experience of standardization .

Trouble shooting

Fault recovery is to review and summarize some service exceptions and service interruptions in the past , To make sure the same problem doesn't happen again next time . In order to get everyone to work together , We want to build a blameless 、 Transparent after the event culture . Individuals should not be afraid of accidents , But to be sure that if an accident happens , The team will respond and improve the system .

remarks : In fact, in China SRE In culture , Generally only for large , An accident that has a significant impact on the business will be reviewed , But in fact, if time and experience allow , For ordinary accidents, it should also be repeated in a small range , The so-called big faults are accumulated from small problems . in addition , In fact, for the operation and maintenance related individuals , We should also be timely small fault re check , In order to continuously strengthen personal fault handling and repair ability .

In my submission SRE A key consensus is to admit the imperfection of the system , It's unrealistic to pursue a system that never stops . Based on imperfect systems , We have to face and experience system failures and failures .

So what's important is not to find the person or the person responsible for the fault , It is more important to find out the root cause of the failure , And how to avoid the same failure again . System reliability is the direction of the whole team , Recover quickly from failure and learn from it , Everyone asks questions at ease , In case of shutdown , And try to improve the system .

remarks : Usually, many enterprises in the process of fault recovery , Relevant personnel may trace the root cause of failures and failures inadvertently As a result of fault liability and a series of punishment measures , By some disciplinary measures to force agreement on the occurrence of the failure , This is often very undesirable , Imagine that everyone doesn't want an accident , Or out of cognition , Either the rules are flawed , There is never a man who knows that there will be a fault and goes to make it .

What needs to be remembered is : Fault is something we can learn from it , It's not something to be afraid or ashamed of ！

In the daily operation and maintenance process , Accidents such as malfunctions are actually a good chance for us to learn again . Monitoring data through history , Analyze the root cause of the accident , Develop follow-up strategies , And through the operation and maintenance platform, these coping strategies are edited into standardization 、 reusable 、 Automatic operation and maintenance application scenario , Provide a standard and quick solution for the subsequent handling of the same problem . This is the most real value of reviewing the process afterwards .

Test and release

Testing and release is primarily a preventive function for overall stability and reliability , Prevention is trying to limit the number of accidents that happen , And make sure that the infrastructure and services are stable when new code is released .

As a person who has been engaged in operation and maintenance for a long time , Perhaps the biggest fear in my heart is the release of a new app . Because in addition to the damage of hardware and network equipment, which is a natural disaster level probability event , The day after the release of the new app is usually a high-risk period of downtime and accidents . therefore , For some large products, the network will be blocked on the eve of holidays and important activities , In order to avoid the business caused by the new version online bug appear .

And testing is about finding the right balance between cost and risk . If you take too much risk , You may be tired of dealing with system failures ; On the other hand , If you are too conservative , You can't release new stuff fast enough , Let enterprises survive in the market .

In the wrong budget more （ That is to say, in a period of time, the fault causes the system to stop for less time ） Under the circumstances , It can reduce the testing resources and relax the testing and conditions of the system online , Let the business have more functions online , To keep the business sensitive ; Less in the wrong budget （ That is, in a period of time, the fault causes the system to shut down for a long time ） Under the circumstances , Then we need to increase the testing resources and tighten the testing of the system online , Let the potential risks of the system be more effectively released , Avoid system shutdown and keep the system stable . The balance between the sensitive state and the steady state , The whole operation and maintenance team and development team are required to jointly undertake .

Besides testing , Application publishing is also a common responsibility of the operation and maintenance team .SRE One of the principles is to code and instrumentalize all repeatable labor ; Besides , The complexity of application publishing is directly proportional to the complexity of the system . So on the application system scale enterprise , We have often started to build automatic application release process based on automation framework .

Through automated publishing tools , We can build a pipeline to implement all the operations in the deployment process （ Such as compiling and packaging 、 Test release 、 Production preparation 、 Alarm shielding 、 Service stopped 、 Database execution 、 Application deployment 、 Service restart, etc ） All automation .

Capacity planning

Capacity planning is about predicting the future and discovering system limits , Capacity planning is also to ensure that the system can be improved and enhanced over time .

The main objective of the plan is to manage risks and expectations , For capacity planning , It's about extending capacity to the whole business ; The expectation of concern is that people want services to respond when they see business growth . The risk is to spend time and money on additional infrastructure to deal with the problem .

First of all, capacity planning is to analyze and judge the predictability of the future , Its prediction is based on massive operation and maintenance data . therefore , Capacity planning in addition to the corresponding architecture and planning team , A comprehensive operation and maintenance data center is a necessary facility for system capacity planning .

Capacity trend warning and analysis will be integrated from various operation and maintenance monitoring 、 Process management and other data sources 、 Arrangement 、 Clean and structurally store all kinds of operation and maintenance data , Integrate the operation and maintenance data from various tools and build various data themes .

The data of these data subjects are used to help the operation and maintenance personnel evaluate the problems , Include ：

What is the current capacity
When to reach the capacity limit
How to change capacity
Perform capacity planning

The operation and maintenance platform can provide necessary data support , It also needs to provide the necessary data visualization support capabilities . Operation and maintenance data visualization provides some necessary capabilities to ensure that operation and maintenance personnel can better use the operation and maintenance data to evaluate the capacity .

First , The operation and maintenance platform needs to have strong data retrieval ability . The operation and maintenance platform stores massive operation and maintenance data , When the operation and maintenance personnel try to establish and verify an exploratory scenario , Often repeatedly retrieve and query specific data . If the data query of the operation and maintenance data analysis platform is very slow or the query angle is very small , The operation and maintenance personnel will take a long time to set up the scene or even can't go on . therefore , Operation and maintenance personnel can implement keywords through the platform 、 Statistical function 、 Single condition 、 Multiple conditions 、 Fuzzy multi dimension search function , And the realization of massive data second level query , In order to help the operation and maintenance personnel analyze the data more effectively .

second , The platform needs powerful data visualization ability . People often say that “ A thousand words is not worth a picture ”, The operation and maintenance personnel often make statistical analysis through the operation and maintenance data of each system and generate various real-time reports , For all kinds of operation and maintenance data （ Such as application log 、 Transaction log 、 system log ） Do multidimensional 、 In depth analysis from multiple perspectives 、 Prediction and visualization , Express and promote the prediction results and experience of their analysis to others .

Automation tool development

SRE It's not just about operations , It's also about software development , Of course, this part refers to operation and maintenance as well as SRE Domain related tools and platform development . stay Google Of SRE In the system ,SRE Engineers will spend about half their time developing new tools and services , Some of these tools are used to automate some manual tasks , And the rest is used to fill and repair the whole thing SRE Other systems within the system .

Free yourself and others from repetitive work by writing code , If we don't need humans to do the job , Then write code , So humans don't have to be part of it .

SRE I despise repetitive work from my heart , Will be added from the original manual passive response , To be more efficient 、 More automatic operation and maintenance system .

Automation operation and maintenance framework ：

The advantages and necessity of automated operation and maintenance tools :

Increase of efficiency : Operated automatically by a program , Effectively reduce the input of operation and maintenance human resources , It also allows the energy of operation and maintenance personnel to be released and put into more important areas .
Standardization of operations : Will turn out to be a lot more complicated 、 The fallible manual operation realizes the unified operation and maintenance operation entrance , Realize white screen operation and maintenance , Improve the manageability of operation and maintenance operations ;
meanwhile , Reduce manual misoperation caused by operation and maintenance personnel's emotions , avoid “ From delete to run away ” The occurrence of such a tragedy .
Inheritance of operation and maintenance experience : Operation and maintenance automation tools summarize the experience accumulated by many operation and maintenance teams into various operation and maintenance tools in code way , Realize automatic and white screen operation and maintenance . The successor of the operation and maintenance team , Can effectively inherit 、 Reuse and optimize them . This kind of code work inheritance , Turn personal ability into team ability , And reduce the impact on the work caused by the flow of people .

The construction of automatic operation and maintenance system must be based on operation and maintenance scenarios , These operation and maintenance scenarios are repeatedly iterated and created in the enterprise , It is the most common operation and maintenance scenario in an enterprise .

For example, common operation and maintenance scenarios ： Software installation deployment 、 Application release delivery 、 Asset management 、 Alarm automatic processing 、 Fault analysis 、 Resource request 、 Automatic inspection and so on . therefore , The whole automatic operation and maintenance system should also support a variety of different types of automatic operation configuration capabilities , Through simple script development 、 Scenario configuration and visual customization process realize more operation and maintenance scenarios .

User experience

What the user experience layer should say is , As SRE Speaking of , From the perspective of users to ensure the stability and availability of business is the ultimate goal . This is the traditional sense of operation and maintenance personnel will not pay attention to this , Because we usually only consider whether the system or the underlying resources of my underlying operation and maintenance are stable , But actually the stability of the whole business is SRE What needs to be concerned about , The stability and availability of services usually need to stand in the perspective of users to simulate and measure the overall availability and reliability .

All of the above mentioned SRE Related areas of work , Whether it's monitoring 、 Accident response 、 review 、 Test and release 、 Capacity planning and build automation tools , It's all for providing better system user business experience . therefore , We need to pay attention to the user experience of the system in the process of operation and maintenance .

But in the actual operation and maintenance work , We can often log through the application 、 Monitoring data 、 User experience information related to services such as service pull-out test . In the operation and maintenance data platform , Through these user experiences, we can monitor the association and concatenation of data , Reproduce the user's final service call link and the relationship between each application link and performance data . The final formation starts with the business user experience data , Gradually realize the system operation status data 、 Opening of data link of equipment operation status , Let the operation and maintenance system achieve the goal of taking the end-user experience as the center .

These user experience messages , For the operation and maintenance team to master the overall user experience of customers 、 System availability monitoring and system targeted optimization provide an irreplaceable role .

Actually ,SRE The operation and maintenance system emphasizes the user experience as the core , By means of automation and operation and maintenance data , Realize application business continuity guarantee , Starting from this point , We will find that it is quite different from the traditional operation and maintenance in the past , We are no longer just installation and deployment engineers , We need to ensure the stability and reliability of the upper layer business through a series of technical means .

原网站

版权声明
本文为[JackieZhengChina]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/175/202206240617075917.html

当前位置：网站首页>What is SRE? A detailed explanation of SRE operation and maintenance system

What is SRE? A detailed explanation of SRE operation and maintenance system

边栏推荐

猜你喜欢

随机推荐