当前位置:网站首页>Huawei cloud "digital intelligence" operation and maintenance

Huawei cloud "digital intelligence" operation and maintenance

2022-06-22 17:56:00 Hua Weiyun

author : Wang Feng  

        To support the rapid growth of Huawei's cloud business , The construction of Huawei cloud operation and maintenance system can be divided into three stages :2016 year —2017 year , Implement operation and maintenance as a tool , Cope with small server size through decentralized maintenance of various small tools , But with the rapid growth of business scale , Instrumentalized means alone can no longer satisfy .2018 year -2019 year , Build an operation and maintenance automation platform , Build an automated operation and maintenance system based on scenarios , Start landing AIOps Ability .2020 So far this year , adopt AI Intelligent operation and maintenance platform under blessing , It is applied in multiple value scenarios of operation and maintenance activities , Enter the intelligent operation and maintenance .

1.PNG

        The industry divides intelligent operation and maintenance into L1 To L5 Several stages , Take the server scale growth as the index ,10 Less than servers , Through simple expert experience 、 Script and manual operation and maintenance .100 The scale of Taiwan , Use multiple independent tools , Make most of the work instrumental 、 The process can basically meet the needs of operation and maintenance . But when the server scale gradually rises to 100000 、 When millions of , When the operation and maintenance manpower cannot grow rapidly with the scale , We must consider improving operation and maintenance efficiency based on data and intelligent means 、 quality 、 cost .DevOps Stage , It mainly focuses on the implementation of single point intelligent capability , Further cascade multiple single point capabilities through data association , Achieve a high degree of automation for some scenarios .AIOps Stage , In improving quality 、 efficiency 、 In terms of cost, we will comprehensively implement intelligent means , Such as through AI Analysis and decision making 、 Unattended changes , And assist intelligent decision-making through data visualization analysis .
      Each stage , We hope to increase the number of O & M servers per capita , The higher the stage , The more decision execution depends on the automation of the system 、 Intelligent , Less dependence on people .

2.PNG

      If you pay attention to Gartner The annual change of AI maturity curve can be seen AIOps The development and changes of the platform , Has gone from 2017 The budding period of innovation in , Develop to 2021 In, it entered the bottom period before the maturity period .Gartner forecast 2-5 The annual meeting will enter a mature period . At the same time, we can see from the annual report G AIOPS Research direction of 2021 Compared to 2017 A more detailed landing scenario was given in , It can be seen from Deloitte's research report that AIOps Of Top5 The scene is mainly : Intelligent alarm 、 Root cause analysis 、 Anomaly detection 、 Capacity optimization and fault self-healing .

3.PNG

      AIOps Landing strategy
      about AIOps Landing strategy , Hua Weiyun mainly organizes 、 Data and platform .
      · organization : By the user 、 The product team and technical team are composed of three parties AIOps Landing project team . Define clear project objectives for value scenarios , Develop feasible technical solutions ; Through the application and effect feedback of the current network , Continuous optimization iterations to achieve the ultimate business value .
      · data : The data quality of application scenarios directly affects the final landing effect , So you need to surround the scene , Collect complete data ; Through business processes and cases, we can accumulate samples to meet the needs of algorithm research ; Through data governance , Standardize storage management data .
      ·AI platform : adopt AIOps Platform building MLOPS Ability , promote AIOps Efficiency of scene landing ; The supporting organization uses data to achieve AIOps The scene landed on the site , And continuously optimize, iterate and improve through the feedback of existing network service effect and model monitoring .
      that , What scene is suitable for landing AIOps? What are the characteristics of these scenarios ? We summed up a few points :
      · Solve the problem of human judgment accuracy based on data ;
      · Mining hidden relationships between data based on known events ;
      · Extrapolate current data based on historical data ;
      · Automatic analysis and assistant decision-making based on data ;
      · Forecast the future based on historical data and experience .
      meanwhile , We divide the application process into three stages : First of all SRE Raise the pain points of business requirements , Quantitative analysis 、 Demand transformation , Determine the corresponding case data ; Then data scientists do data feature analysis , Develop algorithm model ; Finally, the production team will implement the algorithm model as a product .
      We start from the value 、 scene 、 Technical solution 、 The five parts of platform algorithm and data make overall planning for intelligent operation and maintenance . Like fault finding 、 Fault location 、 Root cause analysis 、 Fault avoidance 、 Smart change 、 Intelligent customer service 、 Intelligent scheduling and other important scenarios , Most of the products have been put into production . It is worth mentioning that , Huawei cloud is based on ModelArts Build a platform for intelligent operation and maintenance scenarios on the upper layer AIOps platform , Accelerate scene development and landing speed through platform capabilities .

4.PNG

      AIOps Capacity building
      The following is an expansion to specifically describe the related problems in the fault life cycle AIOps Ability :
      Anomaly detection
      There are a lot of alarms 、 Low alarm accuracy , It has always been the biggest headache for the operation and maintenance personnel . We hope to achieve self adaptation through anomaly detection 、 Maintenance free , To solve the pain point that the traditional static threshold cannot accurately alarm .
Self adaptation refers to the automatic adaptation of different index features to the needs of detection , Automatic perception of cyclical indicators , So that the alarm is not disturbed by seasonal changes . Maintenance free means that there is no need for algorithmic personnel to manually adjust parameters and configuration parameters , Intelligent parameter adjustment solves model parameters that cannot be configured by operation and maintenance personnel . Besides , Algorithm model compression , It greatly reduces the resource cost of the model in training .

5.PNG

      Intelligent alarm
      How to realize alarm noise reduction ? First, classify the alarms , Use the algorithm for continuous alarm 、 Fluctuation alarm 、 Automatic clustering of cause and effect alarms , Then match different algorithm schemes for compression . frequently-used FP-growth, It can mine the frequent relationship of related alarms , Detect by pattern mining and sliding window , To achieve alarm noise reduction . If you want to achieve more accurate alarm compression , Also combine the topological spatial data , Further identify root cause alarms , Improve the efficiency of fault handling .

      Intelligent fault location
      The multi index location algorithm can accurately identify the correlation index that causes the fault ,SRE Through this index, the fault can be quickly defined , Achieve rapid fault recovery ; Log location first extracts the log template , Detect the abnormal template to identify the corresponding abnormal fault node related log error information , Reduce log analysis time ; Combined with indicators 、 journal , Call chain can realize root cause localization of multiple data sources , This method is through the requestor 、 The business scenario of link operation .

6.PNG


      Intelligent fault self-healing
      Fault self-healing means that no manual intervention is required , Automatically complete fault isolation and recovery . But this scenario has great limitations , Core competencies include the following :
      · Automatic drive : Multi source fault process driven , Adapt to various fault scenarios .
      · Intelligent diagnosis : -- diagnose the factors that may induce the fault , Determine the root cause of the failure .
      · Rapid self-healing : According to the diagnosis results , Minute level automatic fault handling , Restore customer business .
      · Safe and reliable : Provide flow control + Baseline scenario + The grayscale mechanism , Avalanche prevention .
      Automatic hardware fault diagnosis & Take self-healing as an example ,AIOPS The system predicts that memory is about to cause host downtime , The self-healing platform will start the diagnosis mechanism after receiving the corresponding prediction alarm , Determine and execute the corresponding self-healing action . When the self-healing process time is short , The impact on customers is very small , It can even make customers feel nothing .
Self healing capability through current hardware failure , Can be realized 5 Minute level hardware fault diagnosis & Automated processing ( From alarm reporting to fault recovery, you only need 5 minute ), Greatly reduce the impact of failure on the customer's business . But self healing doesn't always work , Under the condition that all the logics from fault discovery to processing are satisfied , Will trigger the self-healing process .

7.PNG

      The above is what Huawei cloud has done in the fault life cycle AIOps practice , In this process, we have summarized four main experiences :

      ·Data First: Data quality is AIOps The necessary conditions for successful landing , The quality and effect of the model are determined by the sample data and the current network feedback , Only complete data can find effective features in the feature engineering stage .
      · Engineering is as important as algorithm : The difficulty and importance of the project cannot be underestimated , The problems that cannot be solved by the algorithm shall be compensated from the engineering scheme . For example, some unpredictable scenes in memory , Make up for by engineering means , At the same time, it is necessary to continuously monitor the performance of the operation algorithm model in practice , Find out the deterioration phenomenon in time and implement optimization .
      · The availability of the existing network is more important than the technical index of the algorithm : We need to consider the overall availability after the integration of algorithms and products , Because the current network is not a laboratory , The product quality and stability after the product landing will affect AI Promotion and application of technology , Therefore, the availability of the current network is very important .
      · The cost of algorithm implementation should be considered : It is necessary to fully evaluate the efficiency of the algorithm and the data scale of reasoning , Data scale and algorithm efficiency determine the cost of application resources .

8.PNG

      Last , I hope our practical experience , It can be given to those who are landing or are about to land AIOps My friends help . Huawei will continue to be committed to bringing the digital world to everyone , Building an intelligent world of interconnection of all things .

原网站

版权声明
本文为[Hua Weiyun]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221614013196.html