当前位置:网站首页>Chaos engineering, learn about it
Chaos engineering, learn about it
2022-06-23 22:44:00 【Huawei cloud developer Alliance】
Abstract : What is the background of chaos engineering ? What are the relevant technologies ? There are a lot of online discussions , But how to implement the chaos project ?
This article is shared from Huawei cloud community 《SRE It is bound to move towards the era of chaotic engineering -- Huawei cloud chaos engineering practice 》, author : Li Jun .
Digital age ,IT The system becomes the carrier of business development , It means IT The reliability of the system will directly affect the sustainability of the business , In large Internet or cloud computing companies , A special department will be set up to be responsible for reliability , The industry is called SRE.

SRE In order to improve the reliability of the system , The work is mainly carried out in the two directions of active prevention and passive response , Among them, active prevention occupies 80% The above time . After years of practice ,SRE It is found that chaos engineering is a very effective means of active prevention , In fact, chaos engineering is used in many Internet or cloud computing IT Companies at the cutting edge of technology have launched large-scale . In Huawei cloud , Chaos engineering has been the same as development testing , Become SRE The daily work of the Engineer . Through chaos Engineering , Possible faults can be simulated in advance , And then comprehensively verify the fault tolerance of the service under different fault scenarios 、 Monitoring capabilities 、 Personnel responsiveness 、 Recovery capability and other reliability capabilities . Through the continuous implementation of chaos Engineering , Drive product reliability improvement .

Huawei cloud chaos Engineering Best Practices
Pay attention to chaos engineering from the strategic level
Providing users with stable and reliable cloud services is one of Huawei's cloud strategies , Chaos engineering is the core means of reliability test , Huawei cloud has long-term thinking . At the beginning of the establishment of the business, the chaos project was 5 Business planning for , Have a clear development blueprint , At the same time, cooperate with the fine-grained annual planning , Planning can lead to development , Continuously improve the hybrid engineering capability .
Ensure that chaos project can be implemented in terms of organization and culture
The second aspect of the effective implementation of Huawei cloud chaos project is the clear organizational guarantee . Huawei cloud has set up a special blue army team to take charge of chaos engineering business , Ensure the sustainable development of chaos engineering business through independent organizations ; At the same time, in the performance 、 In terms of qualification, set the threshold related to chaos engineering , tow SRE Actively carry out chaos engineering business . therefore , To carry out chaotic engineering business , In addition to strategic importance , In the organization , Performance design goes hand in hand , To build a good soil for the development of chaotic Engineering .

Build a standard drill environment
The third important condition for implementing chaos engineering is to build a drilling environment close to production . The essence of chaos engineering is to test the reliability of the system , These tests are different from the functional tests in the R & D phase , It is often necessary to have a certain amount of business traffic , And the environmental construction is close to the production system , To effectively identify potential problems . Most companies have a grayscale environment , The new version will go through the gray environment first , Chaos test can be carried out for a certain period of time during the online process , After passing the test, it will be pushed to the production system .

be based on FMEA And the five-dimensional fault analysis method
How to test chaos engineering effectively , Huawei cloud concludes that FMEA+ Five dimensional fault scenario analysis .FMEA Full name: failure mode analysis , This method can effectively guide SRE Analyze the scenario . stay FMEA On the basis of , Huawei cloud has also summarized a five-dimensional fault analysis method , Business as long as from redundancy 、 disaster 、 overload 、 rely on 、 Data backup is analyzed from five perspectives , All fault scenarios can be completely covered . Redundancy refers to the scenario in which some nodes in the computer room fail , Disaster recovery is the whole computer room or Region In the case of failure , The business should do disaster recovery across computer rooms or even in different places , Overload refers to the business flow control and degradation capability when the service capacity reaches or exceeds a certain multiple , Dependency is generally divided into strong dependency and weak dependency , The general principle is that core businesses cannot rely on non core businesses , Non core business failure , The core business should be downgraded , When the core business fails , Analyze the impact . adopt FMEA+ Five dimensional fault analysis , It can effectively analyze the business risks .

Red and blue attack and defense and production raid based on fault injection
In the operation mode of chaos Engineering , In addition to the conventional chaos test , The regular red and blue attack and defense and production raid are also adopted , Red and blue attack and defense usually choose a complete time window , Divide people into 2 Group , One group is the blues , Responsible for analyzing fault scenarios , Design fault injection scheme , And perform fault injection , The aim is to create a real fault ; The other group is the Red Army , Find and recover faults within the specified time , The red blue confrontation can invite the management to participate , Enhance the attention of team members and surrounding teams to chaos engineering . Production raid refers to some failure scenarios when the test is very mature , It can be used without any notice , The blue army injected the fault , In the form of a surprise attack , Compared with the red blue confrontation, the production raid can more comprehensively test the alarm discovery capability of the service , Personnel response manpower , Business resilience , Event organization and operation capability .

Reliability maturity evaluation based on red, yellow and green codes
adopt FMEA Failure scenarios analyzed , The maturity can be evaluated through chaos engineering , Huawei cloud adopts the red, yellow and green code , For scenes that do not have the ability to test chaos, red code is marked , The Yellow code can be used for chaos test, but the monitoring and recovery ability are not up to standard , Monitoring findings , Response speed , Scenarios with recovery speed up to standard are marked with green code , By assessing the maturity of the scenario , It can help the business prioritize the reliability risk , In order to make decisions on improvement .

Build an automated chaos engineering platform
Chaos engineering is a very complex business , It is an essential condition to improve the efficiency of chaotic engineering execution through the tool platform , Huawei cloud has built its own chaos engineering platform internally , The platform includes 4 Large module , They are fault injection modules , Scene choreography and automation module , Audit compliance module , Operation module . The fault injection module is a client , It can provide rich fault scenarios , Network devices can be 、 The physical machine 、 virtual machine 、 Containers 、 operating system 、 agreement 、 middleware 、 database 、 Fault injection is implemented at multiple levels such as application and language ; Combination of scene arrangement and automation module CMDB, Automatic fault injection can be implemented through the O & M pipeline ; The audit compliance module simultaneously gets through the internal change system and the event system , And according to ISO 27001 Design compliance drill report , Make the whole process of chaos test completely online , And comply with certification and audit specifications ; The operation module mainly focuses on the operation and drill records of red and blue offensive and defensive activities , Multi dimensional data analysis , Provide decision-making basis for business .

Summary
Huawei cloud attaches great importance to the role of chaos engineering in reliability , Its SRE The team started from strategic planning at the beginning of its establishment , Organization building , Environmental construction 3 From three aspects, the soil for the development of chaotic engineering is constructed ; And precipitate out FMEA And five dimensional fault analysis , It can effectively help the business clarify the implementation objectives ; Combine red and blue to fight , Production raid , Specific actions such as red, yellow and green codes drive the implementation of chaotic engineering ; At the same time, it has built a platform with rich functions , It provides a convenient tool for the effective development of chaos engineering . Because of the above series of measures , So that chaos engineering can be successfully implemented and play an important role in Huawei cloud reliability construction .
Click to follow , The first time to learn about Huawei's new cloud technology ~
边栏推荐
- API gateway monitoring function the importance of API gateway
- How to use fortress remote server two types of Fortress
- SLSA: 成功SBOM的促进剂
- Detailed explanation of flutter exception capture
- Valid read-only attribute
- Hugegraph: hugegraph Hubble web based visual graph management
- In the "Internet +" era, how can the traditional wholesale industry restructure its business model?
- How to create a virtual server through a fortress machine? What are the functions of the fortress machine?
- Redis6.x.x build rediscluster cluster
- The article "essence" introduces you to VMware vSphere network, vswitch and port group!
猜你喜欢

在宇宙的眼眸下,如何正确地关心东数西算?

Pourquoi une seule valeur apparaît - elle sur votre carte de données?

解密抖音春节红包背后的技术设计与实践

Game security - call analysis - write code

Why is only one value displayed on your data graph?

Opengauss Developer Day 2022 was officially launched to build an open source database root community with developers

脚本之美│VBS 入门交互实战

为什么你的数据图谱分析图上只显示一个值?

openGauss Developer Day 2022正式开启,与开发者共建开源数据库根社区

Slsa: accelerator for successful SBOM
随机推荐
How to deploy API gateways and split services under multi services?
Usage of cobaltstrike: Part 1 (basic usage, listener, redirector)
Go build command (go language compilation command) complete introduction
Micro build low code tutorial -hello, world
Get and post are nothing more than TCP links in nature?
Understand the data consistency between MySQL and redis
2008R2 CFS with NFS protocol
5 minutes to explain what is redis?
You must like these free subtitle online tools: Video subtitle extraction, subtitle online translation, double subtitle merging
Remember a compose version of Huarong Road, you deserve it!
How to create a virtual server through a fortress machine? What are the functions of the fortress machine?
AAAI 2022 | Tencent Youtu 14 papers were selected, including image coloring, face security, scene text recognition and other frontier fields
WordPress plugin smart product review 1.0.4 - upload of any file
MySQL highly available version 1c1g exclusive cloud database value-added special offers!
[tcapulusdb knowledge base] update data example (TDR table)
[tcapulusdb knowledge base] insert data example (TDR table)
脚本之美│VBS 入门交互实战
2022年性价比高的商业养老保险产品排名
C language picture transcoding for performance testing
在宇宙的眼眸下,如何正确地关心东数西算?