当前位置:网站首页>Sre is bound to move towards the era of chaotic engineering -- Huawei cloud chaotic engineering practice
Sre is bound to move towards the era of chaotic engineering -- Huawei cloud chaotic engineering practice
2022-06-22 19:19:00 【Hua Weiyun】
author : Li Jun
Digital age ,IT The system becomes the carrier of business development , It means IT The reliability of the system will directly affect the sustainability of the business , In large Internet or cloud computing companies , A special department will be set up to be responsible for reliability , The industry is called SRE.

SRE In order to improve the reliability of the system , The work is mainly carried out in the two directions of active prevention and passive response , Among them, active prevention occupies 80% The above time . After years of practice ,SRE It is found that chaos engineering is a very effective means of active prevention , In fact, chaos engineering is used in many Internet or cloud computing IT Companies at the cutting edge of technology have launched large-scale . In Huawei cloud , Chaos engineering has been the same as development testing , Become SRE The daily work of the Engineer . Through chaos Engineering , Possible faults can be simulated in advance , And then comprehensively verify the fault tolerance of the service under different fault scenarios 、 Monitoring capabilities 、 Personnel responsiveness 、 Recovery capability and other reliability capabilities . Through the continuous implementation of chaos Engineering , Drive product reliability improvement .

Huawei cloud chaos Engineering Best Practices
Pay attention to chaos engineering from the strategic level
Providing users with stable and reliable cloud services is one of Huawei's cloud strategies , Chaos engineering is the core means of reliability test , Huawei cloud has long-term thinking . At the beginning of the establishment of the business, the chaos project was 5 Business planning for , Have a clear development blueprint , At the same time, cooperate with the fine-grained annual planning , Planning can lead to development , Continuously improve the hybrid engineering capability .
Ensure that chaos project can be implemented in terms of organization and culture
The second aspect of the effective implementation of Huawei cloud chaos project is the clear organizational guarantee . Huawei cloud has set up a special blue army team to take charge of chaos engineering business , Ensure the sustainable development of chaos engineering business through independent organizations ; At the same time, in the performance 、 In terms of qualification, set the threshold related to chaos engineering , tow SRE Actively carry out chaos engineering business . therefore , To carry out chaotic engineering business , In addition to strategic importance , In the organization , Performance design goes hand in hand , To build a good soil for the development of chaotic Engineering .

Build a standard drill environment
The third important condition for implementing chaos engineering is to build a drilling environment close to production . The essence of chaos engineering is to test the reliability of the system , These tests are different from the functional tests in the R & D phase , It is often necessary to have a certain amount of business traffic , And the environmental construction is close to the production system , To effectively identify potential problems . Most companies have a grayscale environment , The new version will go through the gray environment first , Chaos test can be carried out for a certain period of time during the online process , After passing the test, it will be pushed to the production system .

be based on FMEA And the five-dimensional fault analysis method
How to test chaos engineering effectively , Huawei cloud concludes that FMEA+ Five dimensional fault scenario analysis .FMEA Full name: failure mode analysis , This method can effectively guide SRE Analyze the scenario . stay FMEA On the basis of , Huawei cloud has also summarized a five-dimensional fault analysis method , Business as long as from redundancy 、 disaster 、 overload 、 rely on 、 Data backup is analyzed from five perspectives , All fault scenarios can be completely covered . Redundancy refers to the scenario in which some nodes in the computer room fail , Disaster recovery is the whole computer room or Region In the case of failure , The business should do disaster recovery across computer rooms or even in different places , Overload refers to the business flow control and degradation capability when the service capacity reaches or exceeds a certain multiple , Dependency is generally divided into strong dependency and weak dependency , The general principle is that core businesses cannot rely on non core businesses , Non core business failure , The core business should be downgraded , When the core business fails , Analyze the impact . adopt FMEA+ Five dimensional fault analysis , It can effectively analyze the business risks .

Red and blue attack and defense and production raid based on fault injection
In the operation mode of chaos Engineering , In addition to the conventional chaos test , The regular red and blue attack and defense and production raid are also adopted , Red and blue attack and defense usually choose a complete time window , Divide people into 2 Group , One group is the blues , Responsible for analyzing fault scenarios , Design fault injection scheme , And perform fault injection , The aim is to create a real fault ; The other group is the Red Army , Find and recover faults within the specified time , The red blue confrontation can invite the management to participate , Enhance the attention of team members and surrounding teams to chaos engineering . Production raid refers to some failure scenarios when the test is very mature , It can be used without any notice , The blue army injected the fault , In the form of a surprise attack , Compared with the red blue confrontation, the production raid can more comprehensively test the alarm discovery capability of the service , Personnel response manpower , Business resilience , Event organization and operation capability .

Reliability maturity evaluation based on red, yellow and green codes
adopt FMEA Failure scenarios analyzed , The maturity can be evaluated through chaos engineering , Huawei cloud adopts the red, yellow and green code , For scenes that do not have the ability to test chaos, red code is marked , The Yellow code can be used for chaos test, but the monitoring and recovery ability are not up to standard , Monitoring findings , Response speed , Scenarios with recovery speed up to standard are marked with green code , By assessing the maturity of the scenario , It can help the business prioritize the reliability risk , In order to make decisions on improvement .

Build an automated chaos engineering platform
Chaos engineering is a very complex business , It is an essential condition to improve the efficiency of chaotic engineering execution through the tool platform , Huawei cloud has built its own chaos engineering platform internally , The platform includes 4 Large module , They are fault injection modules , Scene choreography and automation module , Audit compliance module , Operation module . The fault injection module is a client , It can provide rich fault scenarios , Network devices can be 、 The physical machine 、 virtual machine 、 Containers 、 operating system 、 agreement 、 middleware 、 database 、 Fault injection is implemented at multiple levels such as application and language ; Combination of scene arrangement and automation module CMDB, Automatic fault injection can be implemented through the O & M pipeline ; The audit compliance module simultaneously gets through the internal change system and the event system , And according to ISO 27001 Design compliance drill report , Make the whole process of chaos test completely online , And comply with certification and audit specifications ; The operation module mainly focuses on the operation and drill records of red and blue offensive and defensive activities , Multi dimensional data analysis , Provide decision-making basis for business .

Summary
Huawei cloud attaches great importance to the role of chaos engineering in reliability , Its SRE The team started from strategic planning at the beginning of its establishment , Organization building , Environmental construction 3 From three aspects, the soil for the development of chaotic engineering is constructed ; And precipitate out FMEA And five dimensional fault analysis , It can effectively help the business clarify the implementation objectives ; Combine red and blue to fight , Production raid , Specific actions such as red, yellow and green codes drive the implementation of chaotic engineering ; At the same time, it has built a platform with rich functions , It provides a convenient tool for the effective development of chaos engineering . Because of the above series of measures , So that chaos engineering can be successfully implemented and play an important role in Huawei cloud reliability construction .
边栏推荐
- 2022年G2电站锅炉司炉题库及在线模拟考试
- STM32 control matrix key, Hal library, cubemx configuration
- 中国两颗风云气象“新星”数据产品向全球用户共享
- AIOps 智能运维经验分享
- In May, 2022, China's game manufacturers and applications went to sea, with top 30 revenue in EMEA region
- Golang 實現 Redis(10): 本地原子性事務
- Activity跳转到Fragment的方法(Intent)
- PostgreSQL 字符串分隔函数(regexp_split_to_table)介绍以及示例应用
- Niuke.com: consolidation interval
- RobotFramework 安装教程
猜你喜欢

函数的导数与微分的关系

Cookie加密3+RPC解法

JSP connection MySQL total error

Alibaba cloud cannot find the account security group id problem during the account transfer

Play typical usage scenarios of kubernetes | dashboard for 5 minutes every day

What happened to this page when sqlserver was saving

Linked list 4- 21 merge two ordered linked lists

The Fourth Youth Life Science Forum | first round notice

链表4- 21 合并两个有序链表

面试MySQL
随机推荐
Babbitt | yuancosmos daily must read: it is said that Tencent has established XR department, and yuancosmos sector has risen again. Many securities companies have issued reports to pay attention to th
Makefile将某一部分文件不编译
Iplook becomes RedHat (red hat) business partner
Game NFT Market: opensea's most easily cut cake
IPLOOK作为O-RAN联盟会员,将共同促进5G产业发展
Introduction to rsps2022 finalist | Dr. Yang Bai
jniLibs.srcDirs = [‘libs‘]有什么用?
Golang 實現 Redis(10): 本地原子性事務
贪心之分配问题(2)
plsql变量赋值问题
How MySQL deletes a column in a database table
贪心之分配问题(1)
Explain the startup process of opengauss multithreading architecture in detail
[learn shell programming easily]-4. The difference between single quotation marks and double quotation marks, the operation of integer values, the definition of arrays in the shell and the detailed us
【建议收藏】消息队列常见的使用场景
DBMS in Oracle_ output. put_ Example of line usage
Golang implements redis (10): local atomic transactions
一些技术想法:
Oracle中dbms_output.put_line的用法实例
Active Directory用户登录报告