当前位置：网站首页>Huawei cloud "digital intelligence" operation and maintenance

Huawei cloud "digital intelligence" operation and maintenance

2022-06-22 17:56:00 【Hua Weiyun】

author ： Wang Feng

To support the rapid growth of Huawei's cloud business , The construction of Huawei cloud operation and maintenance system can be divided into three stages ：2016 year —2017 year , Implement operation and maintenance as a tool , Cope with small server size through decentralized maintenance of various small tools , But with the rapid growth of business scale , Instrumentalized means alone can no longer satisfy .2018 year -2019 year , Build an operation and maintenance automation platform , Build an automated operation and maintenance system based on scenarios , Start landing AIOps Ability .2020 So far this year , adopt AI Intelligent operation and maintenance platform under blessing , It is applied in multiple value scenarios of operation and maintenance activities , Enter the intelligent operation and maintenance .

The industry divides intelligent operation and maintenance into L1 To L5 Several stages , Take the server scale growth as the index ,10 Less than servers , Through simple expert experience 、 Script and manual operation and maintenance .100 The scale of Taiwan , Use multiple independent tools , Make most of the work instrumental 、 The process can basically meet the needs of operation and maintenance . But when the server scale gradually rises to 100000 、 When millions of , When the operation and maintenance manpower cannot grow rapidly with the scale , We must consider improving operation and maintenance efficiency based on data and intelligent means 、 quality 、 cost .DevOps Stage , It mainly focuses on the implementation of single point intelligent capability , Further cascade multiple single point capabilities through data association , Achieve a high degree of automation for some scenarios .AIOps Stage , In improving quality 、 efficiency 、 In terms of cost, we will comprehensively implement intelligent means , Such as through AI Analysis and decision making 、 Unattended changes , And assist intelligent decision-making through data visualization analysis .
Each stage , We hope to increase the number of O & M servers per capita , The higher the stage , The more decision execution depends on the automation of the system 、 Intelligent , Less dependence on people .

If you pay attention to Gartner The annual change of AI maturity curve can be seen AIOps The development and changes of the platform , Has gone from 2017 The budding period of innovation in , Develop to 2021 In, it entered the bottom period before the maturity period .Gartner forecast 2-5 The annual meeting will enter a mature period . At the same time, we can see from the annual report G AIOPS Research direction of 2021 Compared to 2017 A more detailed landing scenario was given in , It can be seen from Deloitte's research report that AIOps Of Top5 The scene is mainly ： Intelligent alarm 、 Root cause analysis 、 Anomaly detection 、 Capacity optimization and fault self-healing .

AIOps Landing strategy
about AIOps Landing strategy , Hua Weiyun mainly organizes 、 Data and platform .
· organization ： By the user 、 The product team and technical team are composed of three parties AIOps Landing project team . Define clear project objectives for value scenarios , Develop feasible technical solutions ; Through the application and effect feedback of the current network , Continuous optimization iterations to achieve the ultimate business value .
· data ： The data quality of application scenarios directly affects the final landing effect , So you need to surround the scene , Collect complete data ; Through business processes and cases, we can accumulate samples to meet the needs of algorithm research ; Through data governance , Standardize storage management data .
·AI platform ： adopt AIOps Platform building MLOPS Ability , promote AIOps Efficiency of scene landing ; The supporting organization uses data to achieve AIOps The scene landed on the site , And continuously optimize, iterate and improve through the feedback of existing network service effect and model monitoring .
that , What scene is suitable for landing AIOps？ What are the characteristics of these scenarios ？ We summed up a few points ：
· Solve the problem of human judgment accuracy based on data ;
· Mining hidden relationships between data based on known events ;
· Extrapolate current data based on historical data ;
· Automatic analysis and assistant decision-making based on data ;
· Forecast the future based on historical data and experience .
meanwhile , We divide the application process into three stages ： First of all SRE Raise the pain points of business requirements , Quantitative analysis 、 Demand transformation , Determine the corresponding case data ; Then data scientists do data feature analysis , Develop algorithm model ; Finally, the production team will implement the algorithm model as a product .
We start from the value 、 scene 、 Technical solution 、 The five parts of platform algorithm and data make overall planning for intelligent operation and maintenance . Like fault finding 、 Fault location 、 Root cause analysis 、 Fault avoidance 、 Smart change 、 Intelligent customer service 、 Intelligent scheduling and other important scenarios , Most of the products have been put into production . It is worth mentioning that , Huawei cloud is based on ModelArts Build a platform for intelligent operation and maintenance scenarios on the upper layer AIOps platform , Accelerate scene development and landing speed through platform capabilities .

AIOps Capacity building
The following is an expansion to specifically describe the related problems in the fault life cycle AIOps Ability ：
Anomaly detection
There are a lot of alarms 、 Low alarm accuracy , It has always been the biggest headache for the operation and maintenance personnel . We hope to achieve self adaptation through anomaly detection 、 Maintenance free , To solve the pain point that the traditional static threshold cannot accurately alarm .
Self adaptation refers to the automatic adaptation of different index features to the needs of detection , Automatic perception of cyclical indicators , So that the alarm is not disturbed by seasonal changes . Maintenance free means that there is no need for algorithmic personnel to manually adjust parameters and configuration parameters , Intelligent parameter adjustment solves model parameters that cannot be configured by operation and maintenance personnel . Besides , Algorithm model compression , It greatly reduces the resource cost of the model in training .

Intelligent alarm
How to realize alarm noise reduction ？ First, classify the alarms , Use the algorithm for continuous alarm 、 Fluctuation alarm 、 Automatic clustering of cause and effect alarms , Then match different algorithm schemes for compression . frequently-used FP-growth, It can mine the frequent relationship of related alarms , Detect by pattern mining and sliding window , To achieve alarm noise reduction . If you want to achieve more accurate alarm compression , Also combine the topological spatial data , Further identify root cause alarms , Improve the efficiency of fault handling .

Intelligent fault location
The multi index location algorithm can accurately identify the correlation index that causes the fault ,SRE Through this index, the fault can be quickly defined , Achieve rapid fault recovery ; Log location first extracts the log template , Detect the abnormal template to identify the corresponding abnormal fault node related log error information , Reduce log analysis time ; Combined with indicators 、 journal , Call chain can realize root cause localization of multiple data sources , This method is through the requestor 、 The business scenario of link operation .

Intelligent fault self-healing
Fault self-healing means that no manual intervention is required , Automatically complete fault isolation and recovery . But this scenario has great limitations , Core competencies include the following ：
· Automatic drive ： Multi source fault process driven , Adapt to various fault scenarios .
· Intelligent diagnosis ： -- diagnose the factors that may induce the fault , Determine the root cause of the failure .
· Rapid self-healing ： According to the diagnosis results , Minute level automatic fault handling , Restore customer business .
· Safe and reliable ： Provide flow control + Baseline scenario + The grayscale mechanism , Avalanche prevention .
Automatic hardware fault diagnosis & Take self-healing as an example ,AIOPS The system predicts that memory is about to cause host downtime , The self-healing platform will start the diagnosis mechanism after receiving the corresponding prediction alarm , Determine and execute the corresponding self-healing action . When the self-healing process time is short , The impact on customers is very small , It can even make customers feel nothing .
Self healing capability through current hardware failure , Can be realized 5 Minute level hardware fault diagnosis & Automated processing （ From alarm reporting to fault recovery, you only need 5 minute ）, Greatly reduce the impact of failure on the customer's business . But self healing doesn't always work , Under the condition that all the logics from fault discovery to processing are satisfied , Will trigger the self-healing process .

The above is what Huawei cloud has done in the fault life cycle AIOps practice , In this process, we have summarized four main experiences ：

·Data First： Data quality is AIOps The necessary conditions for successful landing , The quality and effect of the model are determined by the sample data and the current network feedback , Only complete data can find effective features in the feature engineering stage .
· Engineering is as important as algorithm ： The difficulty and importance of the project cannot be underestimated , The problems that cannot be solved by the algorithm shall be compensated from the engineering scheme . For example, some unpredictable scenes in memory , Make up for by engineering means , At the same time, it is necessary to continuously monitor the performance of the operation algorithm model in practice , Find out the deterioration phenomenon in time and implement optimization .
· The availability of the existing network is more important than the technical index of the algorithm ： We need to consider the overall availability after the integration of algorithms and products , Because the current network is not a laboratory , The product quality and stability after the product landing will affect AI Promotion and application of technology , Therefore, the availability of the current network is very important .
· The cost of algorithm implementation should be considered ： It is necessary to fully evaluate the efficiency of the algorithm and the data scale of reasoning , Data scale and algorithm efficiency determine the cost of application resources .

Last , I hope our practical experience , It can be given to those who are landing or are about to land AIOps My friends help . Huawei will continue to be committed to bringing the digital world to everyone , Building an intelligent world of interconnection of all things .

原网站

版权声明
本文为[Hua Weiyun]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206221614013196.html

当前位置：网站首页>Huawei cloud "digital intelligence" operation and maintenance

Huawei cloud "digital intelligence" operation and maintenance

边栏推荐

猜你喜欢

随机推荐