当前位置:网站首页>Intelligent operation and maintenance exploration | anomaly detection method in cloud system

Intelligent operation and maintenance exploration | anomaly detection method in cloud system

2022-06-23 08:44:00 Jiawei blue whale

Cloud system anomaly detection background

With the rapid development of cloud technology , The complexity and scale of cloud systems are increasing , The stability of cloud system has been greatly challenged . In order to solve the operation and maintenance problem , The operator will pass the index (Metrics)、 journal (Logs) And other multi-dimensional information to understand the operation status of the cloud system .

The method introduced in this paper is to analyze the system index ( Such as CPU Usage rate 、I/O Number of requests 、 Network throughput, etc ) Anomaly detection of cloud system .

For these indicator data , Researchers have proposed a univariate time series anomaly detection method . But as the complexity of cloud systems increases , More and more indicators can be collected by operation and maintenance personnel , This method often can not reflect the abnormal situation of the whole cloud system .

Based on this situation , Researchers also proposed multivariate time series anomaly detection , Although this method considers multiple indicators in the cloud system , But the organizational structure in the cloud system is not taken into account , The applicability is not very good .

In a complex cloud system, we are based on the system topology , Obtain a graph based representation of the system state , Then carry out anomaly detection . With the rapid development of deep neural networks , Researchers have proposed an anomaly detection method based on deep learning , Graph neural network and RNN and CNN Applied together to consider the relationship between space and time , Model the data and topology in the cloud system .

01. Common anomaly detection methods

1. Traditional anomaly detection methods

Static threshold : If the original index exceeds the threshold, it is abnormal .

3sigma: Calculate whether the current value deviates from the historical average by 3 A standard deviation .

A classification based approach , For example, support vector machines .

Based on the nearest neighbor method , For example, local anomaly factor .

2. Deep learning method

Make full use of the timing information in the index for anomaly detection .

Prediction based approach :

Refactoring based approach :

The method based on deep learning is to model with deep learning model according to historical data , Predict or reconstruct the data to be detected , If the error is large, it is judged as abnormal .

02. TopoMAD Detailed analysis of method characteristics

1. TopoMAD Graph neural network is introduced (GNN), With the traditional DNN Compared with the following advantages :

▲ GNN comparison DNN The advantages of

2. TopoMAD Topology information is introduced :

● The feature extractor of graph neural network is shared among similar indexes from different components , It is helpful to capture similar patterns between the same indicator types under unified feature learning .

● Through graph neural network , Components can be defined by connections with other components , This facilitates the end-to-end learning of all components in the system .

● Topology information can guide the model to focus on the interaction of directly connected components in reality , This helps prevent our model from over fitting .

3. Compared with other methods of threshold selection , An unsupervised approach is introduced to generate thresholds , There is no need to adjust the threshold , It can reduce the difficulty of adjusting parameters of the model .

03. TopoMAD Methods to introduce

This paper introduces an automatic encoder based on variation (VAE) Designed anomaly detector .

This is an anomaly detector with topological perception of multivariate time series (TopoMAD), It combines graph neural network (GNN)、 Long and short term memory (LSTM) And variational automatic encoder VAE Perform unsupervised anomaly detection for cloud systems .

TopoMAD The method mainly has the following characteristics :

● TopoMAD It is an unsupervised anomaly detection method , This method considers the topology information of cloud system . We combine this topology information with the metrics collected in the cloud system , A graph based anomaly detection representation is constructed .

● TopoMAD Graph neural network and LSTM As VAE The basic structure is bonded together , Anomaly detection in topological time series . Fig. neural network extracts the spatial topology information of cloud system ,LSTM Extract information from sliding windows over time .

● TopoMAD Using stochastic models VAE Anomaly detection of cloud system in a completely unsupervised way , Train the model on normal and abnormal data , At the same time, an unsupervised threshold selection method is proposed .

TopoMAD The overall process is as follows :

▲ TopoMAD The overall structure

● Data integration and processing

Transform different data collected from different nodes through data standardization , Get the metrics collected from each node X And an array describing the system topology E.

model training

Train the model with historical data in the way of off-line batch processing . After model training , We choose a threshold according to the abnormal score distribution of training data .

Threshold selection

This method selects a threshold by unsupervised method , This threshold maximizes the distance between normal data and abnormal data sets .

Online anomaly detection

Use this properly trained model to calculate the newly observed anomaly score . If the observed anomaly score is higher than the threshold we choose , The alarm will be triggered .

① Data integration and preprocessing

During data preprocessing , We convert different indicators collected from different nodes through the data standardization process , Then take out a certain length of sliding window as input from these sets and processed data .

There are two types of input data :

● X Is the index matrix , Abscissa is the node (Node), The ordinate is the index (Metric), Each row of the matrix represents the value of each index of the node .

● E For the topology of the system , Two related nodes in the system topology are one edge , Each side forms E A column vector of .

▲ Sample input data

② Model design

TopoMAD The architecture process of the model in the method is as follows :

● The whole network is a random seq2seq Automatic encoder , Enter the system topology E And the index information of each node of the system X, adopt GraphLSTM To get the topology information of the system , Further encoded - Output the reconstructed sequence by decoding .

● By further calculation X_t The abnormal score can be used to judge the abnormal , When the anomaly score is above the threshold , An exception will be detected .

● A threshold selection method is proposed , The training data set is cut through this threshold , Maximize the distance between normal and abnormal areas .

③ Basic unit GraphLSTM

GraphLSTM It is a part of encoder and decoder , Is a graph of neural networks and LSTM The combination of . take LSTM Replace the whole connection layer with the neural layer of the graph to get GraphLSTM, Its structure is as follows :

▲ GraphLSTM Overall structure of

④ Online anomaly detection process

TopoMAD The online anomaly detection process is like this :

▲ TopoMAD Online anomaly detection process

04. summary

The method introduced in this paper focuses more on the topology information of the system than the existing methods , Integrate the system topology information into the process of system detection , It mainly has the following advantages

● In traditional LSTM Graph neural network is used in (GNN) The method introduces topological information , Comprehensively consider the spatiotemporal information of multidimensional time series , Take into account the connections between cloud system components , It is helpful to capture similar patterns between the same indicator types under unified feature learning .

● adopt VAE+Seq2Seq In the form of , Increase the learning ability of the model , Compared with the traditional method, the effect of the model is better .

● This method is an unsupervised anomaly detection method , Anomaly detection can be carried out without marked data , At the same time, the threshold is calculated in an unsupervised manner , The sample requirements are lower than those of supervised methods , Simpler calculation .

原网站

版权声明
本文为[Jiawei blue whale]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201101446315261.html