earlier Erda Show Monitoring for microservices 、 Log and other contents were specially shared , Many students are still not satisfied after listening to it , Want to know more about log analysis .Erda The team has been doing log analysis for some time , So this time I'm going to share with you some of the things we're doing , I hope that's helpful .

The log analysis platform is actually Erda A functional module under the microservice governance sub platform , So today I will share from three aspects ：

The necessity of log analysis platform ;
Architecture design of log analysis platform ;
Erda How to do it now 、 What work has been done and the future development direction .

The necessity of log analysis platform

“ Microservices ” This concept is probably in 2013 Years of , From the beginning of this concept to now , The business scenarios of most applications are distributed 、 Containerized deployment architecture , Or at least multi service architecture , Each service is basically non single point , And will do high availability deployment of single service and multiple instances .

In this case , We need to focus on solving several problems about logs .

Problems to be solved

1. The interface reported an error , How to quickly find detailed exception logs in multiple containers of the application ？

The first problem to be solved is the location efficiency of exception logs . such as , The front end is requesting a page , The interface reported an error , Then feed back to the developers , Normal interface error reporting usually does not directly expose special and detailed exception information to users , There will be only one status or a brief error Overview , At this time, developers need to find specific exception information through the log （ Such as stack, etc ）.

Generally speaking , Through the interface path, we can locate which service reported the error , But further , How can we determine which instance of the service reported the error ？ If we adopt this more primitive way , Developers may need to view each instance separately （ Containers ） Log , This will directly affect the development efficiency .

2. How to easily view the logs of application containers that have been down ？

Another problem that needs to be solved is the persistence of log storage . For example, in K8s Under the platform , An application of an application service pod Hung up and rescheduled to other nodes , Or the locally stored container log is lost due to the rolling of time , At this time, if we want to go back and look at the previous log , It's not easy to find on the machine .

3. How to find an exception in an application container in time ？

The first two questions are passive demands , In other words, some problems have been exposed in the front-end business , Then we go back 、 The need to find some detailed records of logs . Because it's user feedback , The fault handling time on this process chain is relatively long , So how to shorten the fault handling time ？

Naturally, we will think of active alarm , Before a large area of front-end interface is discovered by a large number of users , When an exception occurs at the back end , The system can alarm more timely , And inform relevant personnel to deal with it in time , Reduce downtime . At this time, we need a system that can continuously listen to the logs of all containers , Assist developers to find exceptions and actively alarm .

If there is no log analysis platform , In fact, the above three problems are not unsolvable , But it will greatly affect the efficiency . So if there is such a log analysis platform , It provides centralized queries 、 Persistent storage and active alarm , We can solve these three problems quickly and efficiently .

What can log analysis platform provide

Speaking of the necessity of log analysis platform , We must know what kind of service it can provide for us , Now let's take a detailed look at ：

1. Centralized 、 One stop query

In terms of inquiry , The log analysis platform should be centralized 、 One stop query , There is no need to log into different machines or containers to view logs manually and inefficiently , You only need to enter some query statements on a unified page , You can easily query the logs of all containers .

2. Persistence , History can be traced back to

In terms of storage , The log analysis platform can be equipped with the expected special storage quota , In order to better cope with downtime 、 upgrade 、 Scheduling and other situations that cause logs to cross nodes , Keep the simplicity of Querying Historical logs .

3. Intelligent , Take the initiative to find problems

Intelligent alarm is usually a necessary function , Intelligence here has two meanings ：

You can actively configure some rules , For example, according to the code or some exception logs that have occurred , You can know what a particular exception looks like , Then there is a rule , The system will continue to do some rule detection on the input log , If a match is found , It will further pass through the alarm channel configured in advance , Inform the specific person ;

Auto discovery “ abnormal ”, In fact, it is a little similar to the current machine learning 、 Deep learning , in other words , Even if you don't configure any rules , But the system can monitor and learn the log stream , To find abnormal logs , Then inform you to pay attention to , This is something more intelligent .

Architecture design of log analysis platform

We already know , Log analysis platform can bring us convenience and efficiency improvement , So if we want to implement such a platform , How to design the architecture ？

Want to do architecture design , First, understand the business scenarios and requirements , Then combined with the characteristics of the processed data , Before we can infer what capabilities the platform architecture design should have . After that, we will look for 、 Design a matching scheme , And in these programs, select the ones that can be implemented .

Data characteristics

1. Timing data

We know that logs belong to time series data , Just add 、 Don't delete . It has several key fields ：timestamp,tags,fields

timestamp： Time field is a key field for comparing and analyzing time series data ;
tags：tags Represents a set of fields , Usually for time series data , Fields as tag types are generally searchable , That is, these fields need to be indexed , Such as ： service name 、 Container name 、 Containers IP etc. ;
fields：fields Also represents a set of fields , These fields are relative to tags The difference is ,fields Fields are usually used to store content that does not need to be searched , such as ： If you don't plan to search for specific log content , You can use it fields The type field stores .

The characteristics of log timing data remind us , Consider using a temporal database to store logs , such as cassandra.

2. Strong timeliness

For log data , We usually only care about data for a period of time , For data from a long time ago , Like a month 、 Two months ago , Even six months ago , We basically don't care . Because usually when there is a fault , We may need to look at the specific log information , In case of failure, it is unlikely to wait a long time to solve and resume the problem .

3. Large amount of data

A large amount of data has two meanings ： First, a single log of data may be relatively large , Such as Java An exception stack for the application , Especially the kind of Caused by Nesting several layers of , Maybe a single log will have hundreds of lines ; The other is , There are many logs , With the increase of business and Applications , In addition, some applications may open DEBUG Level of logging , The overall log volume will also be relatively large , And there may be short-term peaks .

These are the characteristics of log data , Then let's look at it from the perspective of our log analysis platform , What do we need for the system .

Business requirements and characteristics

1. Fast query speed , Second response

First , We hope it can quickly query , Enter query key , You can respond to the query results at the second level .

2. Time period range query

Usually , The query will operate within a clear time frame , There's a benefit to this ： There will be more choices for back-end storage .

3. High base value point query

What is a high base value ？ Like users IP、Trace ID This kind of data , Almost every user requests a different value , This is a high base . The query of this kind of data is also a strong demand , For example front-end web Interface error , And the response adds Trace ID Such fields , Now you can go through Trace ID Field to view the exception log or key log recorded in the whole process , This is also a common requirement .

4. Tag query

Tag query can generally be regarded as the query of service name 、 Containers IP Such field query requirements , This is also one of the strong needs .

5. Full text search query

Whether full-text search query is one of the strong requirements , It's actually a question worth weighing . If the client has done some preprocessing at the collection end , Such as ： Put the whole line of logs content It is divided into specific time levels during collection 、 Single key fields such as exception type , In this way , Full text search query may not be a strong demand , But at the same time , The range of alternatives may be larger . Here's a reminder , No full-text search support , That doesn't mean you can't fuzzy search . With the high compression rate and high efficiency of column storage IO efficiency , The effect of fuzzy filtering in memory is also very good ！

6. Aggregate statistics

The simplest of aggregation statistics is Count Statistics , More complex is the support of complex aggregation charts based on more field dimensions , These functions are also provided in some products , However, it is necessary to judge whether this item is a strong demand according to the specific needs of individuals .

7. Active alert

Active alarm means that the system not only has the function of passive query , At the same time, it can also find problems and alarm in time , This can reduce the failure time .

After introducing the business requirements 7 After two features , When designing the architecture , Can help us quickly get Considering what needs to be considered .

Architecture requirements

1. Hardware and software costs

Of course , The cost must be , No matter what the design is , Certainly need to consider the cost , This includes the cost of hardware and software .

Hardware cost refers to the number of our machines ：CPU、 Memory 、 Such resources as disks . There is a problem , Because for logs , We talk about a large amount of data , The volume of a single strip may also be relatively large , If it is determined that full-text retrieval is not required , Or retrieve only a few key fields , For those longer fields , Just want to show it along with the search criteria , At this time, we may consider , For these data that do not need to be indexed , Is it possible to use some cheaper storage methods （ Such as OSS） Save it , This saves overall storage costs .

On the other hand, we need to consider the software cost , Take it just now OSS Examples of storage , If we want to store indexed data in a very efficient way , The fields that do not need to be queried are OSS Storage , At this time, the architecture scheme may be a little more complex , Development complexity and difficulty , And human investment will also be relatively high , The overall cost of software will increase accordingly .

2. Storage should have expiration mechanism

Data Effectiveness It also puts forward requirements for storage mechanism , For the expiration mechanism of data , You need to consider , How to ensure and limit the performance consumption when performing data expiration deletion will not have too much impact on the throughput of the whole system .

3. Asynchronous processing , The throughput should be big , Can't be overwhelmed by business traffic

stay Large amount of data In the scene of , The log system is required to accept extreme data , Asynchronous processing and other means need to be considered , Ensure that you cannot be overwhelmed by business traffic . If there is a problem with the business system , The logging system also has problems , As a result, the log system cannot be used to query 、 Check business problems , Then the significance of this platform will be challenged . Generally speaking, we use MQ This middleware is used to do asynchronous peak shaving and valley filling processing .

4. Ad hoc inquiry ability （ Memory 、 cache 、 parallel 、 Efficient filtering and other mechanisms ）

Based on the requirements of query speed , The storage scheme that can respond in seconds will be preferred .

5. The storage structure is friendly to time range query

Time period based range query is one of the most frequent scenarios , For such scenarios , We can consider choosing time series database , Because of its cost optimization of time series query , At the same time, it is also an efficient scheme .

6. Secondary indexing capability

High cardinality single point query requires indexing ability , This will limit our choice of time series database . Because like promethus Such a time series database has problems in single point query of high base value , This is due to the characteristics of its storage . Generally speaking , If we want to support single point queries with high base values , A database with secondary attribute capability is required .

7. Full text retrieval ability

How to support fuzzy query , It's also a factor we need to consider . Because if you want to do a complete full-text retrieval capability support , such as ： participle 、 Correlation calculation, sorting, etc , Our options will be further limited .

8. Storage structure aggregation is user-friendly （ Such as column storage ）

We know that for aggregate statistics operation dialog , The aggregation performance of column storage is relatively high , Because it has a high compression ratio , Many effective data can be read at one time , The whole is right IO It's going to be very efficient .

9. Alarm module

The last point is how to plan the alarm module in the architecture .

The above contents are aimed at the data characteristics 、 Introduce some architecture requirements proposed by business requirements , It can be seen that , The core trade-off is storage . The following picture , The processing flow and key components of the whole architecture are shown , The following content will introduce the selection of storage part .

Storage scheme selection

The storage part in the figure above , There are several open source optional storage middleware ： image Cassandra、Hbase、ElasticSearch、ClickHouse、Grafana Loki etc. . The following figure will compare and analyze the advantages and disadvantages of these middleware ：

The scheme selection chart in the figure above has described each scheme in relatively detail , I won't go into details here , Next let's look at the Erda How to do it in .

Erda Log analysis platform practice

Erda At present, it is actually a common implementation scheme ： Use Elasticsearch As underlying storage . Of course , There are also historical reasons ,Erda Although the open source time of is not very long , But it has existed for a long time . At that time, it seemed that , choice Elasticsearch It is also a more reasonable choice . Of course , The status quo is not the end , We will continue to explore the cost in the future 、 Better solutions in terms of efficiency .

If you choose ES Such a scheme , What should I do ？

The scheme selection chart just now lists the use of Elasticsearch The general core idea of , Next, let's take a closer look at how to do .

Mentioned before , Use Elasticsearch The feature of the scheme is full function and out of the box , In the figure above, it is listed in Erda Some of the key capabilities used in to achieve the current Erda Some of the key functions you want to provide , As shown in the figure below .

in general ,Erda In fact, the current scheme is still conventional , And on this basis, there are some small optimizations , The whole code structure is also a layer of abstraction , I didn't say that I would be tied up in Elasticsearch above , In the future, we will consider supporting some other alternative schemes .

Erda The future direction of log analysis platform

Yes Erda For the log analysis platform , In the future, we have several directions ：

1. More efficient storage 、 Scalable

The first is storage , We also mentioned above ,Erda At present, it is mainly based on Elasticsearch The storage plan of , But use Elasticsearch There is a disadvantage that can not be ignored , That is, its overall resource occupation cost is relatively high , And like Clickhouse、Grafana Loki etc. , Indeed, they have their own advantages that we need to learn from . So Erda As an open source product , More underlying storage may be supported later , Users can choose between these schemes according to their own needs and costs .

in addition , Self research storage will also be one of our investment directions , Because in the monitoring field, in addition to logs , And indicators 、Trace data . Whether these data can adopt a unified storage kernel to reduce the complexity of the system , At the same time, special optimization can be made for different data types to balance cost and performance , These are the starting points for us to consider self-developed storage .

2. Alarm is more convenient 、 More intelligent

Currently in Erda On the platform , If you want to create alarm rules from log analysis , The actual link is still a little long , So in the future, we will optimize the product and function experience on this link .

Another direction is intelligence , Automatic anomaly detection based on log . This is briefly mentioned above , That is, even if the user does not explicitly configure any rules , The system can also help users find unexpected exceptions . The exception here , It doesn't have to be a stack of errors thrown by business applications , It is a relative to “ normal ” The concept of , That is, a very unusual data suddenly appears in the normal data flow , This may require some machine learning models to detect .

The above is what I want to share with you about the log analysis architecture , In the future, we will not forget our original heart , Continuously optimize the user experience and product features , We are also looking forward to more developers interested in this to participate in and build with us Erda, We will listen carefully to every suggestion , Expect more voices to help us become better ！

For more technical dry goods, please pay attention to 【 Erda Erda】 official account , Grow with many open source enthusiasts ～