当前位置:网站首页>Huawei cloud "digital intelligence" operation and maintenance
Huawei cloud "digital intelligence" operation and maintenance
2022-06-22 17:56:00 【Hua Weiyun】
author : Wang Feng
To support the rapid growth of Huawei's cloud business , The construction of Huawei cloud operation and maintenance system can be divided into three stages :2016 year —2017 year , Implement operation and maintenance as a tool , Cope with small server size through decentralized maintenance of various small tools , But with the rapid growth of business scale , Instrumentalized means alone can no longer satisfy .2018 year -2019 year , Build an operation and maintenance automation platform , Build an automated operation and maintenance system based on scenarios , Start landing AIOps Ability .2020 So far this year , adopt AI Intelligent operation and maintenance platform under blessing , It is applied in multiple value scenarios of operation and maintenance activities , Enter the intelligent operation and maintenance .

The industry divides intelligent operation and maintenance into L1 To L5 Several stages , Take the server scale growth as the index ,10 Less than servers , Through simple expert experience 、 Script and manual operation and maintenance .100 The scale of Taiwan , Use multiple independent tools , Make most of the work instrumental 、 The process can basically meet the needs of operation and maintenance . But when the server scale gradually rises to 100000 、 When millions of , When the operation and maintenance manpower cannot grow rapidly with the scale , We must consider improving operation and maintenance efficiency based on data and intelligent means 、 quality 、 cost .DevOps Stage , It mainly focuses on the implementation of single point intelligent capability , Further cascade multiple single point capabilities through data association , Achieve a high degree of automation for some scenarios .AIOps Stage , In improving quality 、 efficiency 、 In terms of cost, we will comprehensively implement intelligent means , Such as through AI Analysis and decision making 、 Unattended changes , And assist intelligent decision-making through data visualization analysis .
Each stage , We hope to increase the number of O & M servers per capita , The higher the stage , The more decision execution depends on the automation of the system 、 Intelligent , Less dependence on people .

If you pay attention to Gartner The annual change of AI maturity curve can be seen AIOps The development and changes of the platform , Has gone from 2017 The budding period of innovation in , Develop to 2021 In, it entered the bottom period before the maturity period .Gartner forecast 2-5 The annual meeting will enter a mature period . At the same time, we can see from the annual report G AIOPS Research direction of 2021 Compared to 2017 A more detailed landing scenario was given in , It can be seen from Deloitte's research report that AIOps Of Top5 The scene is mainly : Intelligent alarm 、 Root cause analysis 、 Anomaly detection 、 Capacity optimization and fault self-healing .

AIOps Landing strategy
about AIOps Landing strategy , Hua Weiyun mainly organizes 、 Data and platform .
· organization : By the user 、 The product team and technical team are composed of three parties AIOps Landing project team . Define clear project objectives for value scenarios , Develop feasible technical solutions ; Through the application and effect feedback of the current network , Continuous optimization iterations to achieve the ultimate business value .
· data : The data quality of application scenarios directly affects the final landing effect , So you need to surround the scene , Collect complete data ; Through business processes and cases, we can accumulate samples to meet the needs of algorithm research ; Through data governance , Standardize storage management data .
·AI platform : adopt AIOps Platform building MLOPS Ability , promote AIOps Efficiency of scene landing ; The supporting organization uses data to achieve AIOps The scene landed on the site , And continuously optimize, iterate and improve through the feedback of existing network service effect and model monitoring .
that , What scene is suitable for landing AIOps? What are the characteristics of these scenarios ? We summed up a few points :
· Solve the problem of human judgment accuracy based on data ;
· Mining hidden relationships between data based on known events ;
· Extrapolate current data based on historical data ;
· Automatic analysis and assistant decision-making based on data ;
· Forecast the future based on historical data and experience .
meanwhile , We divide the application process into three stages : First of all SRE Raise the pain points of business requirements , Quantitative analysis 、 Demand transformation , Determine the corresponding case data ; Then data scientists do data feature analysis , Develop algorithm model ; Finally, the production team will implement the algorithm model as a product .
We start from the value 、 scene 、 Technical solution 、 The five parts of platform algorithm and data make overall planning for intelligent operation and maintenance . Like fault finding 、 Fault location 、 Root cause analysis 、 Fault avoidance 、 Smart change 、 Intelligent customer service 、 Intelligent scheduling and other important scenarios , Most of the products have been put into production . It is worth mentioning that , Huawei cloud is based on ModelArts Build a platform for intelligent operation and maintenance scenarios on the upper layer AIOps platform , Accelerate scene development and landing speed through platform capabilities .

AIOps Capacity building
The following is an expansion to specifically describe the related problems in the fault life cycle AIOps Ability :
Anomaly detection
There are a lot of alarms 、 Low alarm accuracy , It has always been the biggest headache for the operation and maintenance personnel . We hope to achieve self adaptation through anomaly detection 、 Maintenance free , To solve the pain point that the traditional static threshold cannot accurately alarm .
Self adaptation refers to the automatic adaptation of different index features to the needs of detection , Automatic perception of cyclical indicators , So that the alarm is not disturbed by seasonal changes . Maintenance free means that there is no need for algorithmic personnel to manually adjust parameters and configuration parameters , Intelligent parameter adjustment solves model parameters that cannot be configured by operation and maintenance personnel . Besides , Algorithm model compression , It greatly reduces the resource cost of the model in training .

Intelligent alarm
How to realize alarm noise reduction ? First, classify the alarms , Use the algorithm for continuous alarm 、 Fluctuation alarm 、 Automatic clustering of cause and effect alarms , Then match different algorithm schemes for compression . frequently-used FP-growth, It can mine the frequent relationship of related alarms , Detect by pattern mining and sliding window , To achieve alarm noise reduction . If you want to achieve more accurate alarm compression , Also combine the topological spatial data , Further identify root cause alarms , Improve the efficiency of fault handling .
Intelligent fault location
The multi index location algorithm can accurately identify the correlation index that causes the fault ,SRE Through this index, the fault can be quickly defined , Achieve rapid fault recovery ; Log location first extracts the log template , Detect the abnormal template to identify the corresponding abnormal fault node related log error information , Reduce log analysis time ; Combined with indicators 、 journal , Call chain can realize root cause localization of multiple data sources , This method is through the requestor 、 The business scenario of link operation .

Intelligent fault self-healing
Fault self-healing means that no manual intervention is required , Automatically complete fault isolation and recovery . But this scenario has great limitations , Core competencies include the following :
· Automatic drive : Multi source fault process driven , Adapt to various fault scenarios .
· Intelligent diagnosis : -- diagnose the factors that may induce the fault , Determine the root cause of the failure .
· Rapid self-healing : According to the diagnosis results , Minute level automatic fault handling , Restore customer business .
· Safe and reliable : Provide flow control + Baseline scenario + The grayscale mechanism , Avalanche prevention .
Automatic hardware fault diagnosis & Take self-healing as an example ,AIOPS The system predicts that memory is about to cause host downtime , The self-healing platform will start the diagnosis mechanism after receiving the corresponding prediction alarm , Determine and execute the corresponding self-healing action . When the self-healing process time is short , The impact on customers is very small , It can even make customers feel nothing .
Self healing capability through current hardware failure , Can be realized 5 Minute level hardware fault diagnosis & Automated processing ( From alarm reporting to fault recovery, you only need 5 minute ), Greatly reduce the impact of failure on the customer's business . But self healing doesn't always work , Under the condition that all the logics from fault discovery to processing are satisfied , Will trigger the self-healing process .

The above is what Huawei cloud has done in the fault life cycle AIOps practice , In this process, we have summarized four main experiences :
·Data First: Data quality is AIOps The necessary conditions for successful landing , The quality and effect of the model are determined by the sample data and the current network feedback , Only complete data can find effective features in the feature engineering stage .
· Engineering is as important as algorithm : The difficulty and importance of the project cannot be underestimated , The problems that cannot be solved by the algorithm shall be compensated from the engineering scheme . For example, some unpredictable scenes in memory , Make up for by engineering means , At the same time, it is necessary to continuously monitor the performance of the operation algorithm model in practice , Find out the deterioration phenomenon in time and implement optimization .
· The availability of the existing network is more important than the technical index of the algorithm : We need to consider the overall availability after the integration of algorithms and products , Because the current network is not a laboratory , The product quality and stability after the product landing will affect AI Promotion and application of technology , Therefore, the availability of the current network is very important .
· The cost of algorithm implementation should be considered : It is necessary to fully evaluate the efficiency of the algorithm and the data scale of reasoning , Data scale and algorithm efficiency determine the cost of application resources .

Last , I hope our practical experience , It can be given to those who are landing or are about to land AIOps My friends help . Huawei will continue to be committed to bringing the digital world to everyone , Building an intelligent world of interconnection of all things .
边栏推荐
- math_角函数&反三角函数
- Tried several report tools, and finally found a report based on Net 6
- clickhouse 21. X cluster four piece one copy deployment
- 轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷
- 【人脸识别】基于GoogleNet深度学习网络的人脸识别matlab仿真
- The principle of locality in big talk
- Quickly master asp Net authentication framework identity - user registration
- Development mode of JSP learning
- Mybaits:接口代理方式实现Dao
- Quartus Prime 18.0软件安装包和安装教程
猜你喜欢

How to do well in R & D efficiency measurement and index selection

Content recommendation process

MTLs guidelines for kubernetes engineers

Team management | how to improve the thinking skills of technical leaders?
![[small program project development -- Jingdong Mall] rotation chart of uni app development](/img/9b/503919129a2c544dd6143c6b47a2c4.png)
[small program project development -- Jingdong Mall] rotation chart of uni app development

Traitement des valeurs manquantes

测试组的任务职责和测试的基本概念

MySQL master-slave connection prompt of docker: communications link failure

Fluentd is easy to get started. Combined with the rainbow plug-in market, log collection is faster

以小见大:一个领域建模的简单示例,理解“领域驱动”。
随机推荐
关于#数据库#的问题,如何解决?
WPF 实现星空效果
Which platform is safer to buy stocks on?
Docker之MySQL主从连接提示:Communications link failure
Come to Xiamen! Online communication quota free registration
[mysql] data synchronization prompt: specified key was too long; max key length is 767 bytes
A classmate asked what framework PHP should learn?
High voltage direct current (HVDC) model based on converter (MMC) technology and voltage source converter (VSC) (implemented by MATLAB & Simulink)
It may be the most comprehensive Matplotlib visualization tutorial in the whole network
170million passwords of Netcom learning link have been leaked! What are the remedies?
STM32 series (HAL Library) - f103c8t6 hardware SPI illuminates OLED screen with word library
DAP事实表加工汇总功能应用说明
How to do well in R & D efficiency measurement and index selection
Apache ShardingSphere 一文读懂
Mybaits: interface proxy implementation Dao
You call this crap high availability?
What is a flush? Is online account opening safe?
写给 Kubernetes 工程师的 mTLS 指南
Parallel通过XCM与Moonbeam集成,将PARA以及DeFi用例带入Moonbeam生态
Quickly master asp Net authentication framework identity - login and logout