当前位置:网站首页>Intelligent operation and maintenance exploration | anomaly detection method in cloud system
Intelligent operation and maintenance exploration | anomaly detection method in cloud system
2022-06-23 08:44:00 【Jiawei blue whale】
Cloud system anomaly detection background
With the rapid development of cloud technology , The complexity and scale of cloud systems are increasing , The stability of cloud system has been greatly challenged . In order to solve the operation and maintenance problem , The operator will pass the index (Metrics)、 journal (Logs) And other multi-dimensional information to understand the operation status of the cloud system .
The method introduced in this paper is to analyze the system index ( Such as CPU Usage rate 、I/O Number of requests 、 Network throughput, etc ) Anomaly detection of cloud system .
For these indicator data , Researchers have proposed a univariate time series anomaly detection method . But as the complexity of cloud systems increases , More and more indicators can be collected by operation and maintenance personnel , This method often can not reflect the abnormal situation of the whole cloud system .
Based on this situation , Researchers also proposed multivariate time series anomaly detection , Although this method considers multiple indicators in the cloud system , But the organizational structure in the cloud system is not taken into account , The applicability is not very good .
In a complex cloud system, we are based on the system topology , Obtain a graph based representation of the system state , Then carry out anomaly detection . With the rapid development of deep neural networks , Researchers have proposed an anomaly detection method based on deep learning , Graph neural network and RNN and CNN Applied together to consider the relationship between space and time , Model the data and topology in the cloud system .
01. Common anomaly detection methods
1. Traditional anomaly detection methods
● Static threshold : If the original index exceeds the threshold, it is abnormal .
● 3sigma: Calculate whether the current value deviates from the historical average by 3 A standard deviation .
● A classification based approach , For example, support vector machines .
● Based on the nearest neighbor method , For example, local anomaly factor .
2. Deep learning method
Make full use of the timing information in the index for anomaly detection .
● Prediction based approach :
● Refactoring based approach :
The method based on deep learning is to model with deep learning model according to historical data , Predict or reconstruct the data to be detected , If the error is large, it is judged as abnormal .
02. TopoMAD Detailed analysis of method characteristics
1. TopoMAD Graph neural network is introduced (GNN), With the traditional DNN Compared with the following advantages :
2. TopoMAD Topology information is introduced :
● The feature extractor of graph neural network is shared among similar indexes from different components , It is helpful to capture similar patterns between the same indicator types under unified feature learning .
● Through graph neural network , Components can be defined by connections with other components , This facilitates the end-to-end learning of all components in the system .
● Topology information can guide the model to focus on the interaction of directly connected components in reality , This helps prevent our model from over fitting .
3. Compared with other methods of threshold selection , An unsupervised approach is introduced to generate thresholds , There is no need to adjust the threshold , It can reduce the difficulty of adjusting parameters of the model .
03. TopoMAD Methods to introduce
This paper introduces an automatic encoder based on variation (VAE) Designed anomaly detector .
This is an anomaly detector with topological perception of multivariate time series (TopoMAD), It combines graph neural network (GNN)、 Long and short term memory (LSTM) And variational automatic encoder VAE Perform unsupervised anomaly detection for cloud systems .
TopoMAD The method mainly has the following characteristics :
● TopoMAD It is an unsupervised anomaly detection method , This method considers the topology information of cloud system . We combine this topology information with the metrics collected in the cloud system , A graph based anomaly detection representation is constructed .
● TopoMAD Graph neural network and LSTM As VAE The basic structure is bonded together , Anomaly detection in topological time series . Fig. neural network extracts the spatial topology information of cloud system ,LSTM Extract information from sliding windows over time .
● TopoMAD Using stochastic models VAE Anomaly detection of cloud system in a completely unsupervised way , Train the model on normal and abnormal data , At the same time, an unsupervised threshold selection method is proposed .
TopoMAD The overall process is as follows :
● Data integration and processing
Transform different data collected from different nodes through data standardization , Get the metrics collected from each node X And an array describing the system topology E.
● model training
Train the model with historical data in the way of off-line batch processing . After model training , We choose a threshold according to the abnormal score distribution of training data .
● Threshold selection
This method selects a threshold by unsupervised method , This threshold maximizes the distance between normal data and abnormal data sets .
● Online anomaly detection
Use this properly trained model to calculate the newly observed anomaly score . If the observed anomaly score is higher than the threshold we choose , The alarm will be triggered .
① Data integration and preprocessing
During data preprocessing , We convert different indicators collected from different nodes through the data standardization process , Then take out a certain length of sliding window as input from these sets and processed data .
There are two types of input data :
● X Is the index matrix , Abscissa is the node (Node), The ordinate is the index (Metric), Each row of the matrix represents the value of each index of the node .
● E For the topology of the system , Two related nodes in the system topology are one edge , Each side forms E A column vector of .
② Model design
TopoMAD The architecture process of the model in the method is as follows :
● The whole network is a random seq2seq Automatic encoder , Enter the system topology E And the index information of each node of the system X, adopt GraphLSTM To get the topology information of the system , Further encoded - Output the reconstructed sequence by decoding .
● By further calculation X_t The abnormal score can be used to judge the abnormal , When the anomaly score is above the threshold , An exception will be detected .
● A threshold selection method is proposed , The training data set is cut through this threshold , Maximize the distance between normal and abnormal areas .
③ Basic unit GraphLSTM
GraphLSTM It is a part of encoder and decoder , Is a graph of neural networks and LSTM The combination of . take LSTM Replace the whole connection layer with the neural layer of the graph to get GraphLSTM, Its structure is as follows :
④ Online anomaly detection process
TopoMAD The online anomaly detection process is like this :
04. summary
The method introduced in this paper focuses more on the topology information of the system than the existing methods , Integrate the system topology information into the process of system detection , It mainly has the following advantages :
● In traditional LSTM Graph neural network is used in (GNN) The method introduces topological information , Comprehensively consider the spatiotemporal information of multidimensional time series , Take into account the connections between cloud system components , It is helpful to capture similar patterns between the same indicator types under unified feature learning .
● adopt VAE+Seq2Seq In the form of , Increase the learning ability of the model , Compared with the traditional method, the effect of the model is better .
● This method is an unsupervised anomaly detection method , Anomaly detection can be carried out without marked data , At the same time, the threshold is calculated in an unsupervised manner , The sample requirements are lower than those of supervised methods , Simpler calculation .
边栏推荐
- 史上最污技术解读,60 个 IT 术语我居然秒懂了......
- usb peripheral 驱动 - debug
- H-index of leetcode topic analysis
- Go 数据类型篇(二)之Go 支持的数据类型概述及布尔类型
- Summary of Arthas vmtool command
- Le rapport d'analyse de l'industrie chinoise des bases de données a été publié en juin. Le vent intelligent se lève, les colonnes se régénèrent
- 5-旋转的小菊-旋转画布和定时器
- APM performance monitoring practice of jubasha app
- 528. Random Pick with Weight
- Easycvr accesses the website through the domain name. How to solve the problem that the video cannot be viewed back?
猜你喜欢

高通9x07两种启动模式

Self organizing map neural network (SOM)
![[paper notes] catching both gray and black swans: open set supervised analog detection*](/img/52/787b25a9818cfc6a1897af81d41ab2.png)
[paper notes] catching both gray and black swans: open set supervised analog detection*

【论文笔记】Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection*

Summary of communication mode and detailed explanation of I2C drive

297. Serialize and Deserialize Binary Tree

Object. Defineproperty() and data broker

谈谈 @Autowired 的实现原理

Install a WGet for your win10

通信方式总结及I2C驱动详解
随机推荐
Leetcode topic analysis set matrix zeroes
Leetcode topic analysis count primes
Go data types (II) overview of data types supported by go and Boolean types
In June, China database industry analysis report was released! Smart wind, train storage and regeneration
Jetpack family - ViewModel
APM performance monitoring practice of jubasha app
ArcLayoutView: 一个弧形布局的实现
Paper reading [quovadis, action recognition? A new model and the dynamics dataset]
How to evaluate code quality
Kernel log debugging method
Linux MySQL installation
[QNX Hypervisor 2.2用户手册]5.6.1 Guest关机时静默设备
Geoserver添加mongoDB数据源
【论文笔记】Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection*
高通9x07两种启动模式
Lighthouse cloud desktop experience
Leetcode topic analysis sort colors
Introduction to typescript and basic types of variable definitions
Why is the easycvr Video Fusion platform offline when cascading with the Hikvision platform? How to solve it?
Explanation on webrtc's stun/turn service in tsingsee green rhino video