当前位置：网站首页>Cloud native observability tracking technology in the eyes of Baidu engineers

Cloud native observability tracking technology in the eyes of Baidu engineers

2022-07-23 15:04:00 【Baidu geek said】

author | daydreamer

One 、 The concept is introduced

In the cloud native domain , Observability refers to inferring and measuring the internal state of the system from the external output , Describe the level of understanding of what happens in the system . The three common foundations of observability are Metrics、Tracing and Logging：

Monitoring indicators (Metrics): The defining characteristics of monitoring indicators are aggregatable , It is to form a single logical indicator within a period of time 、 Atoms of counters or histograms . for example ： Incoming http The number of requests can be modeled as a counter , Its update aggregation is simple addition .
track (Tracing): Its defining feature is that it processes information within the scope of the request , Any data or metadata that can be bound to the life cycle of a single transaction object in the system . for example ： Actual sent to database sql The text of the query .
journal (Logging): The defining feature of a log is that it handles discrete events . for example ： Application debugging or error messages are sent to the cluster for unified processing through a cuttable file .

In these three areas , Monitoring indicators often require the least resources to manage , Because in essence , they “ Compress ” It's done quite well ; Logging tends to overwhelmingly exceed the production traffic it reports . From the perspective of data volume and cost , From monitoring ( low ) To the log ( high ), Tracking technology tracing Maybe somewhere in the middle .

Of course , The significance of observability is to better serve business systems , In practice, the application is often tailored to business needs , Analyze it abstractly , You can see the functions and characteristics of several observable systems , for example , Open source prometheus The project was initially started as a monitoring system , as time goes on , It may develop in the direction of tracking , To enter the requested scope (request-scope) Monitoring in , But it probably won't go deep into the log space .

Two 、 Track technical data models （ With OpenTelemetry Standards, for example ）

We use OpenTelemetry standard [1] For example , Brief introduction tracing General data model .

The classic model of tracking data originally came from Google The classic paper of [2], It defines a set of general data reporting interfaces , Each distributed tracking system is required to implement this interface , So as to adapt to various distributed tracking systems that comply with this standard , And for developers , Can be based on business needs , Switch different distributed tracking systems at will .

OpenTelemetry The trace in the standard is determined by their Span The implicit definition . especially , Can be Trace As Spans Directed acyclic graph of （DAG）, among Spans The edge between is called References.

Every Span The following states are encapsulated ：

Name
Start and End Timestamps
Span Context
Span context uses two identifiers to provide specific context about tracking and span ：Trace ID and Span ID. Every Span By one in Trace The only one ID identification , be called Span ID.Span Use Trace ID To mark span Relationship with its tracking .Span Context To describe the relationship across service and process boundaries .
Attributes
Key value pairs containing metadata （key-value）, You can use metadata to annotate Span To carry information about the operation it is tracking .
Span Events
Is considered to be Span Structured log messages on （ Or comments ）, Usually used to indicate Span Meaningful single point in duration .
Span Links
You can put a span With one or more span Related to , So as to describe the upstream and downstream relationship of execution . for example , Suppose we have a distributed system , In response to some of these actions （ Call it operation a）, An extra operation （ Call it operation b） Queued for execution , operation b The execution of is asynchronous . We hope to operate b And operation a Related to , But we cannot predict the operation b When to start . At this time, the operation a the last one span Link to action b first span, So as to describe their upstream and downstream relationship .
Span Status
Status code

3、 ... and 、 industry tracing to ground Uber Jaeger

Uber The company 2016 In, it opened source excellent in the field of cloud native tracing platform Jaeger[3]. At that time, the background of the project was ,Uber Facing the exponential growth of business, the number of micro services has increased , For the observability of large-scale distributed microservice architecture , There is still a lack of a perfect tracking platform . After the project is open source , Has been sought after by the industry ,2017 By the CNCF Accept as a graduation project . The platform features a single API The design of the times has changed into distributed design , Realize unified context propagation context, And the decision of sampling strategy is transferred to the tracking backend , Allow the backend to dynamically adjust the sampling rate . The platform fully supports OpenTelemetry standard , It has always been a de facto standard product in the cloud native field .

Jaeger The overall architecture of is as follows ：

among jaeger-client It is a client-side collection component , Support dynamic traffic simulation , Perceptible to storage pressure .jaeger-agent Responsible for sampling related strategies .jaeger-collector be responsible for tracing Data collection , Sorting and transferring .jaeger-ui and jaeger-query Responsible for the platform UI Interaction . The access method supports middleware embedding , Support HTTP Other protocols , The underlying storage uses Cassandra,Elasticsearch And other open source storage platforms .

about Jaeger See the official website for details of ：https://www.jaegertracing.io/

Alibaba eagle eye platform

Eagle eye is a new generation of log based distributed call tracking system created by Alibaba for mass traffic activities such as the double 11 . It solves the difficulty of fault location , Capacity estimation is difficult , Waste more resources , It is difficult to sort out links and other online problems . The platform has the following characteristics ：

Architecture iterations are becoming lighter , Data presentation is more real-time , Upgrade from batch to streaming
Visualize the monitoring process , Reduce access costs , The construction design is handed over to the user
Sample the data according to the analysis scenario , For example, the analysis of link morphology does not require full data

The platform adopts unified log printing form and measurement standard , Hot spots and capacity estimates can be found , Support pressure measurement , Illegal traffic . Achieve global call statistics ,trace Query, real-time monitoring and other functions . Support http/tcp Such agreement , The access method supports middleware embedding , Bytecode enhancement, etc , The bottom layer uses HDFS/HBASE/HSTORE/MPP Wait to store the database . The following figure shows the global call topology of the platform .

For the introduction of the platform, please refer to Alibaba cloud's introduction article 《 Best practice in building a three-dimensional monitoring system 》.

Four 、 Difficulties encountered in actual work and solutions

In Baidu's actual production system , We also widely use tracking technology to realize real-time monitoring of the whole link 、 Traffic performance statistics and requests trace Query and case Troubleshooting and other functions . In the process of application , We also have some practical experience ：

1. The amount of data tracked is large , Mainly in ：

High acquisition pressure , requirement SDK High performance , Due to the decentralized reporting and transmission of requests, there is little pressure
Optimize and realize reasonable sampling strategy
Deep optimization coding and Mapping Algorithm
Data is classified according to types and usage scenarios , Choose different underlying storage

2. Access costs are very important , Need to keep costs very low . Mainly in ：

Developers are not very active in accessing non business code , In this case, the interface design is simple 、 Easy to use , A large number of buried points can rely on the underlying framework , Custom burying points are easy to use
SDK Use the conciseness of documentation 、 accurate , There are good existing practice scenarios that can be reused directly
The cost performance of investment should be high , We can see the practical problems solved by the system

3. High stability requirements , The main experience is ：

Use local persistence as a buffer
The combination of flow and task , Insert trick task etc.

4. The need to implement some advanced features , for example ：

Confidence analysis of indicators , And data science
Real time analysis of multiple aggregation windows , Both short-term timeliness and long-term trend analysis
More intuitive display form , How to intuitively display as much information as possible with as few indicators

in general , With OpenTelemetry Standard landing , Cloud native observability tracking technology is also developing , It has also been widely used in the production environment , It strongly supports the stability of large-scale distributed microservice systems , performance , Efficiency and other aspects . We start from the perspective of a Baidu engineer , Can also see the leopard , At first sight, the breadth and depth of observability .

---------- END ----------

Reference material ：

[1] OpenTelemetry：https://opentelemetry.io/docs/concepts/signals/traces/#spans-in-opentelemetry

[2] Benjamin H. Sigelman Luiz André Barroso, et. al. 2010, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

[3] Uber Jaeger：https://eng.uber.com/distributed-tracing/

Recommended reading 【 Technical gas station 】 series ：

Use Baidu developer tools 4.0 Build a dedicated applet IDE

Baidu engineers teach you to play with design patterns （ Observer mode ）

Uncover the practice of Baidu intelligent test in the field of automatic test execution

H.265 Introduction to coding principles

Small program startup performance optimization practice

Baidu engineers teach you to play with design patterns （ The singleton pattern ）