当前位置:网站首页>Huawei cloud "digital intelligence" operation and maintenance
Huawei cloud "digital intelligence" operation and maintenance
2022-06-22 17:56:00 【Hua Weiyun】
author : Wang Feng
To support the rapid growth of Huawei's cloud business , The construction of Huawei cloud operation and maintenance system can be divided into three stages :2016 year —2017 year , Implement operation and maintenance as a tool , Cope with small server size through decentralized maintenance of various small tools , But with the rapid growth of business scale , Instrumentalized means alone can no longer satisfy .2018 year -2019 year , Build an operation and maintenance automation platform , Build an automated operation and maintenance system based on scenarios , Start landing AIOps Ability .2020 So far this year , adopt AI Intelligent operation and maintenance platform under blessing , It is applied in multiple value scenarios of operation and maintenance activities , Enter the intelligent operation and maintenance .

The industry divides intelligent operation and maintenance into L1 To L5 Several stages , Take the server scale growth as the index ,10 Less than servers , Through simple expert experience 、 Script and manual operation and maintenance .100 The scale of Taiwan , Use multiple independent tools , Make most of the work instrumental 、 The process can basically meet the needs of operation and maintenance . But when the server scale gradually rises to 100000 、 When millions of , When the operation and maintenance manpower cannot grow rapidly with the scale , We must consider improving operation and maintenance efficiency based on data and intelligent means 、 quality 、 cost .DevOps Stage , It mainly focuses on the implementation of single point intelligent capability , Further cascade multiple single point capabilities through data association , Achieve a high degree of automation for some scenarios .AIOps Stage , In improving quality 、 efficiency 、 In terms of cost, we will comprehensively implement intelligent means , Such as through AI Analysis and decision making 、 Unattended changes , And assist intelligent decision-making through data visualization analysis .
Each stage , We hope to increase the number of O & M servers per capita , The higher the stage , The more decision execution depends on the automation of the system 、 Intelligent , Less dependence on people .

If you pay attention to Gartner The annual change of AI maturity curve can be seen AIOps The development and changes of the platform , Has gone from 2017 The budding period of innovation in , Develop to 2021 In, it entered the bottom period before the maturity period .Gartner forecast 2-5 The annual meeting will enter a mature period . At the same time, we can see from the annual report G AIOPS Research direction of 2021 Compared to 2017 A more detailed landing scenario was given in , It can be seen from Deloitte's research report that AIOps Of Top5 The scene is mainly : Intelligent alarm 、 Root cause analysis 、 Anomaly detection 、 Capacity optimization and fault self-healing .

AIOps Landing strategy
about AIOps Landing strategy , Hua Weiyun mainly organizes 、 Data and platform .
· organization : By the user 、 The product team and technical team are composed of three parties AIOps Landing project team . Define clear project objectives for value scenarios , Develop feasible technical solutions ; Through the application and effect feedback of the current network , Continuous optimization iterations to achieve the ultimate business value .
· data : The data quality of application scenarios directly affects the final landing effect , So you need to surround the scene , Collect complete data ; Through business processes and cases, we can accumulate samples to meet the needs of algorithm research ; Through data governance , Standardize storage management data .
·AI platform : adopt AIOps Platform building MLOPS Ability , promote AIOps Efficiency of scene landing ; The supporting organization uses data to achieve AIOps The scene landed on the site , And continuously optimize, iterate and improve through the feedback of existing network service effect and model monitoring .
that , What scene is suitable for landing AIOps? What are the characteristics of these scenarios ? We summed up a few points :
· Solve the problem of human judgment accuracy based on data ;
· Mining hidden relationships between data based on known events ;
· Extrapolate current data based on historical data ;
· Automatic analysis and assistant decision-making based on data ;
· Forecast the future based on historical data and experience .
meanwhile , We divide the application process into three stages : First of all SRE Raise the pain points of business requirements , Quantitative analysis 、 Demand transformation , Determine the corresponding case data ; Then data scientists do data feature analysis , Develop algorithm model ; Finally, the production team will implement the algorithm model as a product .
We start from the value 、 scene 、 Technical solution 、 The five parts of platform algorithm and data make overall planning for intelligent operation and maintenance . Like fault finding 、 Fault location 、 Root cause analysis 、 Fault avoidance 、 Smart change 、 Intelligent customer service 、 Intelligent scheduling and other important scenarios , Most of the products have been put into production . It is worth mentioning that , Huawei cloud is based on ModelArts Build a platform for intelligent operation and maintenance scenarios on the upper layer AIOps platform , Accelerate scene development and landing speed through platform capabilities .

AIOps Capacity building
The following is an expansion to specifically describe the related problems in the fault life cycle AIOps Ability :
Anomaly detection
There are a lot of alarms 、 Low alarm accuracy , It has always been the biggest headache for the operation and maintenance personnel . We hope to achieve self adaptation through anomaly detection 、 Maintenance free , To solve the pain point that the traditional static threshold cannot accurately alarm .
Self adaptation refers to the automatic adaptation of different index features to the needs of detection , Automatic perception of cyclical indicators , So that the alarm is not disturbed by seasonal changes . Maintenance free means that there is no need for algorithmic personnel to manually adjust parameters and configuration parameters , Intelligent parameter adjustment solves model parameters that cannot be configured by operation and maintenance personnel . Besides , Algorithm model compression , It greatly reduces the resource cost of the model in training .

Intelligent alarm
How to realize alarm noise reduction ? First, classify the alarms , Use the algorithm for continuous alarm 、 Fluctuation alarm 、 Automatic clustering of cause and effect alarms , Then match different algorithm schemes for compression . frequently-used FP-growth, It can mine the frequent relationship of related alarms , Detect by pattern mining and sliding window , To achieve alarm noise reduction . If you want to achieve more accurate alarm compression , Also combine the topological spatial data , Further identify root cause alarms , Improve the efficiency of fault handling .
Intelligent fault location
The multi index location algorithm can accurately identify the correlation index that causes the fault ,SRE Through this index, the fault can be quickly defined , Achieve rapid fault recovery ; Log location first extracts the log template , Detect the abnormal template to identify the corresponding abnormal fault node related log error information , Reduce log analysis time ; Combined with indicators 、 journal , Call chain can realize root cause localization of multiple data sources , This method is through the requestor 、 The business scenario of link operation .

Intelligent fault self-healing
Fault self-healing means that no manual intervention is required , Automatically complete fault isolation and recovery . But this scenario has great limitations , Core competencies include the following :
· Automatic drive : Multi source fault process driven , Adapt to various fault scenarios .
· Intelligent diagnosis : -- diagnose the factors that may induce the fault , Determine the root cause of the failure .
· Rapid self-healing : According to the diagnosis results , Minute level automatic fault handling , Restore customer business .
· Safe and reliable : Provide flow control + Baseline scenario + The grayscale mechanism , Avalanche prevention .
Automatic hardware fault diagnosis & Take self-healing as an example ,AIOPS The system predicts that memory is about to cause host downtime , The self-healing platform will start the diagnosis mechanism after receiving the corresponding prediction alarm , Determine and execute the corresponding self-healing action . When the self-healing process time is short , The impact on customers is very small , It can even make customers feel nothing .
Self healing capability through current hardware failure , Can be realized 5 Minute level hardware fault diagnosis & Automated processing ( From alarm reporting to fault recovery, you only need 5 minute ), Greatly reduce the impact of failure on the customer's business . But self healing doesn't always work , Under the condition that all the logics from fault discovery to processing are satisfied , Will trigger the self-healing process .

The above is what Huawei cloud has done in the fault life cycle AIOps practice , In this process, we have summarized four main experiences :
·Data First: Data quality is AIOps The necessary conditions for successful landing , The quality and effect of the model are determined by the sample data and the current network feedback , Only complete data can find effective features in the feature engineering stage .
· Engineering is as important as algorithm : The difficulty and importance of the project cannot be underestimated , The problems that cannot be solved by the algorithm shall be compensated from the engineering scheme . For example, some unpredictable scenes in memory , Make up for by engineering means , At the same time, it is necessary to continuously monitor the performance of the operation algorithm model in practice , Find out the deterioration phenomenon in time and implement optimization .
· The availability of the existing network is more important than the technical index of the algorithm : We need to consider the overall availability after the integration of algorithms and products , Because the current network is not a laboratory , The product quality and stability after the product landing will affect AI Promotion and application of technology , Therefore, the availability of the current network is very important .
· The cost of algorithm implementation should be considered : It is necessary to fully evaluate the efficiency of the algorithm and the data scale of reasoning , Data scale and algorithm efficiency determine the cost of application resources .

Last , I hope our practical experience , It can be given to those who are landing or are about to land AIOps My friends help . Huawei will continue to be committed to bringing the digital world to everyone , Building an intelligent world of interconnection of all things .
边栏推荐
- [recruitment] [Beijing Zhongguancun / remote] [tensorbase][open source data warehouse] and other people do one thing
- 面试突击58:truncate、delete和drop的6大区别!
- Mybaits: common database operations (Neusoft operations)
- Missing value handling
- Cross platform brake browser
- Blazor University (31) form - Validation
- math_ Angular function & inverse trigonometric function
- Thoughts on joint primary key
- 0 basic how to get started software testing, can you succeed in changing careers?
- Application description of DAP fact table processing summary function
猜你喜欢

抢先报名丨新一代 HTAP 数据库如何在云上重塑?TiDB V6 线上发布会即将揭晓!
![[fpga+pwm] design and implementation of phase shift trigger circuit for three-phase PWM rectifier based on FPGA](/img/ad/c039932abe409d696380e8427173c0.png)
[fpga+pwm] design and implementation of phase shift trigger circuit for three-phase PWM rectifier based on FPGA

You call this crap high availability?

试用了多款报表工具,终于找到了基于.Net 6开发的一个了

快速掌握 ASP.NET 身份认证框架 Identity - 用户注册

内容推荐流程

Quartus prime 18.0 software installation package and installation tutorial

Noah fortune plans to land on the Hong Kong Stock Exchange: the performance fell sharply in the first quarter, and once stepped on the thunder "Chengxing case"

Description of new features and changes in ABP Framework version 5.3.0

Recommend 7 super easy-to-use terminal tools - ssh+ftp
随机推荐
Defaultifempty for C # -linq source code analysis
测试组的任务职责和测试的基本概念
Arrays Aslist uses bug
缺失值處理
推荐7款超级好用的终端工具 —— SSH+FTP
A new mode of enterprise software development: low code
Docker 之MySQL 重启,提示Error response from daemon: driver failed programming external connectivity on **
Thoughts on joint primary key
Xshell 7 (SSH Remote Terminal tool) v7.0.0109 official Chinese Version (with file + installation tutorial)
Graduation season · undergraduate graduation thoughts -- the self-help road of mechanical er
.NET 发布和支持计划介绍
Parallel通过XCM与Moonbeam集成,将PARA以及DeFi用例带入Moonbeam生态
C#-Linq源码解析之DefaultIfEmpty
Quartus Prime 18.0软件安装包和安装教程
Is flush easy to open an account? Is it safe to open an account online?
同花顺软件是什么?网上开户安全么?
Missing value handling
It may be the most comprehensive Matplotlib visualization tutorial in the whole network
STM32系列(HAL库)——F103C8T6硬件SPI点亮带字库OLED屏
Come to Xiamen! Online communication quota free registration