当前位置:网站首页>Get to know Prometheus

Get to know Prometheus

2022-06-25 10:29:00 yeedomliu

The author of this article :kevinkrcai( Caikeren )

Mind mapping

brief introduction

Official website guide :From metrics to insight, Indicators to insights

  1. Prometheus It is a data monitoring solution , Let us know the running status of the system at any time , Quickly locate problems and troubleshoot .
  2. 12 Development completed in ,16 To join in CNCF, To become a successor K8s And then the second one CNCF Managed projects , at present Github 42k Of , Basically stable at 1 Months 1 A small version of the iteration speed

Overall ecology

Indicator exposure , To index crawl 、 Storage and Visualization , And a series of components such as the final monitoring alarm .

Indicator exposure

  1. Every being Prometheus The monitoring services are all one Job
  2. Prometheus For these Job The official SDK , Use this SDK You can customize and export your own business indicators
  3. You can also use Prometheus Various commonly used components and middleware provided by the government Exporter( For example, common MySQL,Consul wait ).
  4. For script tasks that are executed in a short time, it may be difficult to directly Pull Indicator Services ,Prometheus Provides PushGateWay The gateway actively pushes the service indicators for these tasks Push To gateway ,Prometheus From this gateway Pull indicators .

Index grabbing

Pull Model

The monitoring service actively pulls the indicators of the monitored service , The monitored services are generally exposed by initiative metrics Port or through Exporter The way to expose indicators , The monitoring service relies on the service discovery module to discover the monitored service , So as to regularly capture indicators

Push Model

The monitored service actively pushes the indicators to the monitoring service , It may be necessary to make protocol adaptation for indicators , Must conform to the indicator format required by the monitoring service , about Prometheus Index fetching in , It's using Pull Model , The default is to pull indicators once a minute , adopt Prometheus.yaml In the configuration file scrape_interval Configuration Item Configuration ,Prometheus It is used externally Pull Model , One is Pull Exporter Indicators of exposure , One is Pull PushGateway Indicators of exposure .

Index storage and query

After the index is captured, it will be stored in the built-in time series database ,Prometheus Also provided PromQL The query language gives us the query of indicators , We can do it in Prometheus Of WebUI through PromQL, Visually query our metrics , You can also easily access third-party visualization tools , for example grafana.

Monitoring alarm

prometheus Provides alertmanageer be based on promql To monitor and alarm the system , When promql When the queried index exceeds the threshold value defined by us ,prometheus An alarm message will be sent to alertmanager,manager The alarm will be sent to the configured mailbox or wechat .

working principle

Prometheus The process from registration of the monitored service to indicator fetching to indicator query is divided into five steps :

Service registration

The monitored service is in Prometheus There is a Job There is , All instances of the monitored service are in Prometheus There is a target The existence of , So the registration of the monitored service is in Prometheus Register one Job And all its target

Static registration

Static will serve IP And the port number for fetching indicators are configured in Prometheus yaml Of documents scrape_configs Under configuration

scrape_configs:
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]

Dynamic registration

Dynamic registration is in Prometheus yaml Of documents scrape_configs Configure the address and service name of service discovery under configuration ,Prometheus Will go to that address , Dynamically discover the instance list according to the service name you provide , stay Prometheus in , Support consul,DNS, file ,K8s And so on . be based on consul Service discovery of :

- job_name: "node_export_consul"
    metrics_path: /node_metrics
    scheme: http
    consul_sd_configs:
      - server: localhost:8500
        services:
          - node_exporter

We consul The address of is :localhost:8500, The service name is node_exporter, There is one under this service exporter example :localhost:9600

Be careful : If dynamic registration , It is better to add these two configurations , The path pulled by the static registration indicator will be specified as... By default metrics_path:/metrics, So if the exposed indicators have different crawl paths or dynamic service registration , It is better to add these two configurations . Otherwise, it will report a mistake “INVALID“ is not a valid start token, Under the demo , Baidu for a while , It may be that the data format is not uniform

metrics_path: /node_metrics
scheme: http

And then finally in the webUI View the instances found in :

at present ,Prometheus Support more than 20 service discovery protocols :

<azure_sd_config>
<consul_sd_config>
<digitalocean_sd_config>
<docker_sd_config>
<dockerswarm_sd_config>
<dns_sd_config>
<ec2_sd_config>
<openstack_sd_config>
<file_sd_config>
<gce_sd_config>
<hetzner_sd_config>
<http_sd_config>
<kubernetes_sd_config>
<kuma_sd_config>
<lightsail_sd_config>
<linode_sd_config>
<marathon_sd_config>
<nerve_sd_config>
<serverset_sd_config>
<triton_sd_config>
<eureka_sd_config>
<scaleway_sd_config>
<static_config>

Configuration update

After the update Prometheus After the configuration file , We need to update our configuration to the program memory , There are two ways to update , The first is simple and crude , Is to restart Prometheus, The second is dynamic update . How to realize dynamic update Prometheus To configure .

First step : First of all, make sure to start Prometheus Bring the startup parameters when :--web.enable-lifecycle

prometheus --config.file=/usr/local/etc/prometheus.yml --web.enable-lifecycle

The second step : To update our Prometheus To configure

The third step : After updating the configuration , We can go through Post How to request , Dynamic update configuration :

curl -v --request POST 'http://localhost:9090/-/reload'

principle :Prometheus stay web Module , Registered a handler

if o.EnableLifecycle {
   router.Post("/-/quit", h.quit)
   router.Put("/-/quit", h.quit)
   router.Post("/-/reload", h.reload)  // reload To configure 
   router.Put("/-/reload", h.reload)   
}

adopt h.reload This handler Method realization : This handler Is to go to a channle Send a signal in :

func (h *Handler) reload(w http.ResponseWriter, r *http.Request) {
   rc := make(chan error)
   h.reloadCh <- rc    //  Send a signal to channe In the middle 
   if err := <-rc; err != nil {
      http.Error(w, fmt.Sprintf("failed to reload config: %s", err), http.StatusInternalServerError)
   }
}

stay main Function will listen for this channel, As long as there is a monitoring signal , It will be configured reload, Reload the new configuration into memory

case rc := <-webHandler.Reload():
   if err := reloadConfig(cfg.configFile, cfg.enableExpandExternalLabels, cfg.tsdb.EnableExemplarStorage, logger, noStepSubqueryInterval, reloaders...); err != nil {
      level.Error(logger).Log("msg", "Error reloading config", "err", err)
      rc <- err
   } else {
      rc <- nil
   }

Index capture and storage

Prometheus Take the initiative to capture indicators Pull The way , That is, periodic requests are exposed by the monitoring service metrics Interface or PushGateway, So as to obtain Metrics indicators , The default time is 15s Grab it once , The configuration items are as follows :

global:
 scrape_interval: 15s

The captured indicators will be saved in memory in the form of time series , And periodically brush to the disk , The default is to brush back every two hours . And to prevent Prometheus Data can be recovered in case of crash or restart ,Prometheus It also offers something like MySQL in binlog Same pre write log , When Prometheus Crash restart , Read the pre write log to recover the data .

Metric indicators

Data model

Prometheus All the collected indicators are stored in the form of time series , Each time series consists of three parts :

  1. Indicator name and indicator label set :metric_name{<label1=v1>,<label2=v2>....}, Index name : Indicates which aspect of the indicator is being monitored , such as http_request_total Express : Number of requests ; Indicator label , Describe the dimensions of this indicator , such as http_request_total This indicator , There is a request status code code = 200/400/500, Request mode :method = get/post etc. , In fact, the indicator name is actually saved in the form of a tag , The label is name, namely :name=<metric name>
  2. Time stamp : Describe the time of the current time series , Company : millisecond
  3. Sample value : Specific values of current monitoring indicators , such as http_request_total The value of is the number of requests .

By looking at Prometheus Of metrics Interface to view all submitted indicators :

All indicators are also identified in the format shown below :

# HELP    // HELP: Information about the indicators described here , Indicates what the indicator is , Statistics or something 
# TYPE    // TYPE: What type of indicator is this 
<metric name>{<label name>=<label value>, ...}  value    //  Specific format of indicators ,< Index name >{ Label set }  Index value 

The index type

Prometheus In fact, there is no distinction between indicators in the underlying storage , Are stored in the form of time series , However, in order to facilitate users' use and understand the differences between different monitoring indicators ,Prometheus Defined 4 There are different types of indicators :

  1. Counter counter
  2. The dashboard gauge
  3. Histogram histogram
  4. Abstract summary

Counter Counter

Counter The type and redis The same as the autoincrement command , Increase or decrease , adopt Counter Indicators can be counted Http Number of requests , Number of request errors , Monotonically increasing data such as the number of interface calls . At the same time increase and rate Equal function statistical change rate , These built-in functions will be mentioned later .

Gauge The dashboard

and Counter Different ,Gauge It can be increased or decreased , It can reflect some dynamic data , For example, the current memory usage ,CPU utilize ,Gc Times and other dynamic data that can rise or fall , stay Prometheus through Gauge, It can directly reflect the changes of data without using built-in functions , The following figure shows the allocable space of the heap

Histogram Histogram Type

Histogram and Summary It is a statistical indicator , Indicates the distribution of data Histogram Histogram : We can observe the distribution of indicators in different ranges , As shown in the figure below : You can observe the distribution of request time in each bucket

Histogram Is the cumulative histogram , That is, each barrel has only the upper section , For example, the figure below shows less than 0.1 millisecond (le="0.1") The number of requests for is 18173 individual , Less than 0.2 millisecond (le="0.2") Our request is 18182 individual , stay le="0.2" This bucket contains le="0.1" The data of this bucket , If we want to get 0.1 Milliseconds to 0.2 Number of requests in milliseconds , You can reduce it by two barrels .

In the histogram , You can also use histogram_quantile Function to find the percentile , such as P50,P90,P99 Data such as

Summary Abstract

Summary It is also used for statistical analysis , and Histogram The difference lies in ,Summary What is stored directly is the percentile , As shown below : The median of the sample can be observed intuitively ,P90 and P99

Summary The percentile of is calculated directly by the client Prometheus grabbing , Unwanted Prometheus Calculation , Histograms are generated through built-in functions histogram_quantile stay Prometheus The server calculates

Index export

There are two ways to export indicators , One is to use Prometheus Customized services provided by the community Exporter For some components such as MySQL,Kafka And so on , You can also take advantage of what the community offers Client Customize indicator Export .

github.com/prometheus/client_golang/prometheus/promhttp

Customize Prometheus exporter:

package main

import (
   "net/http"

   "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main()  {
   http.Handle("/metrics", promhttp.Handler())
   http.ListenAndServe(":8080", nil)
}

visit :http://localhost:8080/metrics, You can see the exported indicators , Here we do not have any customized indicators , But you can see some built-in Go Runtime metrics and promhttp Relevant indicators , This Client It defaults to the indicators we expose ,go*: With go The index for prefix is about Go Runtime related metrics , Such as garbage collection time 、goroutine Quantity, etc , These are all Go Client library specific , Client libraries for other languages may expose other runtime metrics for their respective languages .promhttp_*: come from promhttp Relevant indicators of the toolkit , Used to track the processing of indicator requests .

Add a custom indicator :

package main

import (
   "net/http"

   "github.com/prometheus/client_golang/prometheus"
   "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {

   // 1. Define indicators ( type , name , Help information )
   myCounter := prometheus.NewCounter(prometheus.CounterOpts{
      Name: "my_counter_total",
      Help: " Customize counter",
   })
   // 2. Registration index 
   prometheus.MustRegister(myCounter)
   // 3. Set the indicator value 
   myCounter.Add(23)

   http.Handle("/metrics", promhttp.Handler())
   http.ListenAndServe(":8080", nil)
}

function :

Report the interface request volume in the business under simulation

package main

import (
   "fmt"
   "net/http"

   "github.com/prometheus/client_golang/prometheus"
)

var (
   MyCounter prometheus.Counter
)

// init  Registration index 
func init() {
   // 1. Define indicators ( type , name , Help information )
   MyCounter = prometheus.NewCounter(prometheus.CounterOpts{
      Name: "my_counter_total",
      Help: " Customize counter",
   })
   // 2. Registration index 
   prometheus.MustRegister(MyCounter)
}

// Sayhello
func Sayhello(w http.ResponseWriter, r *http.Request) {
   //  The number of interface requests increases 
   MyCounter.Inc()
   fmt.Fprintf(w, "Hello Wrold!")
}

main.go:

package main

import (
   "net/http"

   "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {

   http.Handle("/metrics", promhttp.Handler())
   http.HandleFunc("/counter",Sayhello)
   http.ListenAndServe(":8080", nil)
}

At the start , indicators counter yes 0

call :/counter After the interface , The indicator data has changed , In this way, you can simply count the number of interface requests

The definition method for other indicators is the same :

var (
   MyCounter prometheus.Counter
   MyGauge prometheus.Gauge
   MyHistogram prometheus.Histogram
   MySummary prometheus.Summary
)

// init  Registration index 
func init() {
   // 1. Define indicators ( type , name , Help information )
   MyCounter = prometheus.NewCounter(prometheus.CounterOpts{
      Name: "my_counter_total",
      Help: " Customize counter",
   })
   //  Definition gauge Type indicator 
   MyGauge = prometheus.NewGauge(prometheus.GaugeOpts{
      Name: "my_gauge_num",
      Help: " Customize gauge",
   })
   //  Definition histogram
   MyHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{
      Name: "my_histogram_bucket",
      Help: " Customize histogram",
      Buckets: []float64{0.1,0.2,0.3,0.4,0.5},   //  Bucket needs to be specified 
   })
   //  Definition Summary
   MySummary = prometheus.NewSummary(prometheus.SummaryOpts{
      Name: "my_summary_bucket",
      Help: " Customize summary",
      //  This part can be calculated in set
      Objectives: map[float64]float64{
         0.5: 0.05,
         0.9: 0.01,   
         0.99: 0.001, 
      },
   })

   // 2. Registration index 
   prometheus.MustRegister(MyCounter)
   prometheus.MustRegister(MyGauge)
   prometheus.MustRegister(MyHistogram)
   prometheus.MustRegister(MySummary)
}

The above indicators are not labeled , Our general indicators are labeled , How to set the label of indicators ? If I want to set up a tagged counter Type indicator , Just put the original NewCounter Replace the method with NewCounterVec The method can , And pass in the tag collection

MyCounter *prometheus.CounterVec
// 1. Define indicators ( type , name , Help information )
MyCounter = prometheus.NewCounterVec(
   prometheus.CounterOpts{
   Name: "my_counter_total",
   Help: " Customize counter",
   },
   //  Label set 
   []string{"label1","label2"},
)
//  Tagged set Index value 
MyCounter.With(prometheus.Labels{"label1":"1","label2":"2"}).Inc()

PromQL

Just mentioned Prometheus What are the types of indicators in and how to export our indicators , Now the indicators are exported to Prometheus 了 , Take advantage of what it provides PromQL You can query our exported indicators PromQL yes Prometheus The functional query language provided for us , There are four types of query expressions :

  1. character string : Only as arguments to some built-in functions
  2. Scalar : A single numeric value , It can be a function parameter , It can also be the return result of a function
  3. Instantaneous vector : Time series data at a certain time
  4. Interval vector : Time series data set in a certain time interval

Instantaneous query

You can query directly through the indicator name , The query result is the latest time series of the current indicator , Such as query Gc Accumulated elapsed time :

go_gc_duration_seconds_count

We can see that there are multiple indicator results with the same name It can be used {} Do tag filter query : For example, we want to check the indicators of the specified instance

go_gc_duration_seconds_count{instance="127.0.0.1:9600"}

It also supports the expression , adopt =~ Specify regular expressions , As shown below : Query all instance yes localhost The starting indicator

go_gc_duration_seconds_count{instance=~"localhost.*"}

Range queries

The result set of range query is interval vector , Can pass [] Specify a time for range query

Inquire about 5 Minutes of the Gc Cumulative elapsed time

go_gc_duration_seconds_count{}[5m]

Be careful : The first point of the range query here is not necessarily accurate to just right 5 The time series sample point minutes ago , He is with 5 Minutes as an interval , Find the first point to the last sample point in this interval .

Time unit :

d: God ,h: Hours ,m: minute ,ms: millisecond ,s: second ,w: Zhou ,y: year Also support similar SQL Medium offset Inquire about , as follows : Query the current one day ago 5 Time series data set minutes ago

go_gc_duration_seconds_count{}[5m] offset 1d

Built in functions

Prometheus Built in many functions , Here we mainly record the use of several commonly used functions :rate and irate function :rate Function can be used to find the average change rate of the index

rate function = The difference between the two points before and after the time interval  /  Time range 

commonly rate Function can be used to calculate the request rate in a certain time interval , That's what we often say QPS

however rate The function simply calculates the average rate over a time interval , There is no way to reflect sudden changes , Suppose in a one minute time interval , front 50 The number of requests per second is 0 To 10 about , But in the end 10 The number of requests per second has exploded to 100 above , At this time, the calculated value may not well reflect the peak change . This problem can be solved by irate Function solution ,irate The function gives the instantaneous rate of change

 The difference between the last two sample points in the time interval  /  Time difference between the last two sample points 

You can see the difference between the two through the image :irate The image peak value of the function changes greatly ,rate The function changes gently rate function

irate function :

Aggregate functions :Sum() by() without() This is also the example above , We are looking for the specified interface QPS When , There may be multiple instances of QPS Calculated results of , There are multiple interfaces , Three services QPS

rate(demo_api_request_duration_seconds_count{job="demo", method="GET", status="200"}[5m])

utilize sum The function can combine three QPS polymerization , You can get the interface of the entire service QPS: Actually Sum Is to add the index values

But this direct addition is too general and abstract , Can cooperate with by and without Function in sum When , Group based on certain tags , similar SQL Medium group by for example , I can group by request interface tag : What you get in this way is the specific interface QPS

sum(rate(demo_api_request_duration_seconds_count{job="demo", method="GET", status="200"}[5m])) by(path)

You can also not group according to the interface path : adopt without Appoint

sum(rate(demo_api_request_duration_seconds_count{job="demo", method="GET", status="200"}[5m])) without(path)

Can pass histogram_quantile Function to do data statistics : It can be used to count the percentile : The first parameter is the percentile , the second histogram indicators , This is the median , namely P50

histogram_quantile(0.5,go_gc_pauses_seconds_total_bucket)

Share the pits you found with your colleagues : In the customization just written exporter Add a few more histogram Sample point of :

MyHistogram.Observe(0.3)
MyHistogram.Observe(0.4)
MyHistogram.Observe(0.5)

histogram Bucket settings for :

MyHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{
   Name: "my_histogram_bucket",
   Help: " Customize histogram",
   Buckets: []float64{0,2.5,5,7.5,10},    //  Bucket needs to be specified 
})

If so , All indicators will go directly to the first bucket , namely 0 To 2.5 This bucket , If I want to calculate the median , Well, if the median is calculated according to the mathematical formula , It must be 0 To 2.5 Between , And it must be 0.3 To 0.5 Between . I use histogram_quantile Function calculation : The result is 1.25, In fact, it has been wrong .

histogram_quantile(0.5,my_histogram_bucket_bucket) 

I am calculating P99, be equal to 2.475

histogram_quantile(0.99,my_histogram_bucket_bucket)

My indicators are not greater than 1 Of , Why did you calculate it P50 and P99 It's all so outrageous ? This is because Prometheus He doesn't save your specific index value , He will help you put the indicators into specific buckets , But he won't save the value of your indicator , The calculated quantile is an estimated value , How to estimate ? It is assumed that the sample distribution in each barrel is uniform , Linear distribution , Like just P50, In fact, it is ranked No 50% Sample value of position , Because all the data just fell in the first bucket , Then he will assume this when calculating 50% The value is at the midpoint of the first bucket , He would assume that this number is 0.5 2.5,P99 It's the first bucket 99% The location of , He would assume that this number is 0.99 2.5 The reason for this error is our bucket The setting is unreasonable . Redefine the bucket :

//  Definition histogram
MyHistogram = prometheus.NewHistogram(prometheus.HistogramOpts{
   Name: "my_histogram_bucket",
   Help: " Customize histogram",
   Buckets: []float64{0.1,0.2,0.3,0.4,0.5},   //  Bucket needs to be specified 
})

Report data :

MyHistogram.Observe(0.1)
MyHistogram.Observe(0.3)
MyHistogram.Observe(0.4)

Recalculate P50,P99

The more reasonable the barrel setting is , The smaller the calculation error

Grafana visualization

In addition to being able to use Prometheus Provided webUI Visualize our metrics outside , You can also access it Grafana To visualize the indicators . First step , Docking data source :

Good configuration prometheus The address of :

The second step : Create dashboard

Edit dashboard

stay metrics Prepared by PromQL Query and visualization can be completed

After the dashboard is edited , You can export the corresponding json file , It is convenient to import the same dashboard next time

The above is the dashboard I built before :

Monitoring alarm

AlertManager yes prometheus The alarm information distribution component provided , Contains the grouping of alarm information , Send out , Silence and other strategies After the configuration is completed, you can use webui See the corresponding alarm strategy information on the . Alarm rules are also based on PromQL It's customized . Write alarm configuration : When Http_srv The service is down ,Prometheus Unable to collect indicators , And the duration 1 minute , It will trigger an alarm

groups:
- name: simulator-alert-rule
  rules:
  - alert: HttpSimulatorDown
    expr: sum(up{job="http_srv"}) == 0 
    for: 1m
    labels:
      severity: critical

stay prometheus.yml Configure alarm configuration file in , It needs to be configured alertmanager And the address of the alarm file

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "alert_rules.yml"
    #- "first_rules.yml"

Configure alarm information , For example, alarm sending address , Alarm content template , The grouping strategy and so on alertmanager Configuration in the configuration file of :

global:
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxx'
  smtp_require_tls: false

route:
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'mail-receiver'

#  group_by             // Which tag to use as a grouping 
#  group_wait           // The waiting time of the group , The alarm is not sent immediately after receiving it , But wait for a while , See if there are other alarms in the same group , If any, send them together 
#  group_interval       // Alarm interval 
#  repeat_interval      // Repeat alarm interval , It can reduce the frequency of sending alarms 
#  receiver             // Who is the recipient 
#  routes               // Sub route configuration 
receivers:
- name: 'mail-receiver'
  email_configs:
    - to: '[email protected]'

When I kill process :

prometheus The alarm has been triggered :

Waiting for the 1 minute , If it continues to comply with the alarm strategy , Then the status is from pending Turn into FIRING Will send an email to my mailbox

At this time, my email received an alarm message

alertmanager It also supports silent alarm , stay alertmanager Of WEBUI Medium configuration is ok

The interval 4 minute , No alarm received , Silent effect

原网站

版权声明
本文为[yeedomliu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/176/202206250956388676.html