当前位置:网站首页>Bosun query

Bosun query

2022-06-24 15:20:00 Wang Lei -ai Foundation

background

bosun It's a by Stack Exchange Open source monitoring and alarm system , The tools that can be benchmarked are prometheus Of alertmanager. bosun Is designed to work with a variety of tsdb Configure the monitoring alarm system , however bosun It also provides a set of dsl Used to query and monitor 、 Evaluation indicators , bring bosun It's also a kind of tsdb irrelevant ( Current support such as opentsdb, prometheus, influxdb, es Many other kinds tsdb Back end ) Index query language . To understand bosun How to generate an alarm , Or just use his index query ability , Coordination as grafana Such a monitoring front-end to display indicators , Then you must understand the language .

bosun It is not a very hot project , There are 3.1k star, There are few documents about him in the market , Most of them are literal translations of official documents . The purpose of this article is to introduce bosun How to query ( Mainly for the backend opentsdb), And some query skills .

Concept

First of all, understand bosun Some type concepts in queries :

  1. Scalar It's just a number
  2. NumberSet and Scalar It's basically one thing , But there is one more group tag, empty {} It is tag
  3. SeriesSet yes The most common format for characterizing the original indicator , and NumberSet Different , Its corresponding value is not a number , It's a set of associated timestamp values , such as Time 100 The next value 3.14, Time 200 The next value 3.28
  4. Results It is not a concept introduced in the document , It is the most common type of query in practice , It represents the most common result of a query : It's a group tag Different SeriesSet perhaps NumberSet Etc . There are different in the document tags Combination is also called group.

Inquire about

in the light of opentsdb Query in ,bosun Several query methods are provided

q

q(query string, startDuration string, endDuration string) seriesSet

This is the most commonly used query method , Most alarms are also queried with this statement , This statement is very simple , among query yes opentsdb Of Query statement ,startDuration and endDuration Is the start and end query time , such as q(sum:rate{counter}:sys.cpu.user, 5m, 1m), Represents a query sys.cpu.user indicators 5m Forward to 1m Some time ago sum:rate. There is a delay in the collection of indicators ,endDuration It is generally recommended that at least 1m front .

#  Example  
q("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "1m")

group	result	computations
{ }	
{
  "1620196950": 2.016666666666667,
  "1620196980": 2.3666666666666667,
  "1620197010": 1.0999999999999999,
  "1620197040": 1.8333333333333335,
  "1620197070": 2.7333333333333343,
  "1620197100": 2.5,
  "1620197130": 1.7000000000000002,
  "1620197160": 0.9666666666666666
}

bandQuery/overQuery

bandQuery(query string, duration string, period string, eduration string, num scalar) seriesSet band(query string, duration string, period string, num scalar) seriesSet

  • bandQuery It means use query Statement is executed multiple times (num Time ) Inquire about , The time range of each query is determined by duration/period decision ,
  • band yes bandQuery A special form of , It's equivalent to setting up eduration = period, such as band("avg:os.cpu", "1h", "1d", 3) It is equivalent to querying the following three statements q("avg:os.cpu", "25h", "1d"), q("avg:os.cpu", "49h", "2d"), q("avg:os.cpu", "73h", "3d"), Because it's set up eduration=period, So the latest cycle is (period+duration,period)

overQuery(query string, duration string, period string, eduration string, num scalar) seriesSet over(query string, duration string, period string, num scalar) seriesSet shiftBand(query string, duration string, period string, num scalar) seriesSet

  • overQuery yes over and shiftBand The common form of , and bandQuery The difference is that , After the query, the query result will be tagged with query offset "shifted"
  • over and shiftBand It's just overQuery A special form of , It is just equivalent to giving overQuery Of eduration Set to period and current time ( That is, do not fill in )
#  Example , because   The rest are just  bandQuery  and  overQuery  A special form of , Here are just two examples of these queries 

> bandQuery("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "60m", "1m", 2)


group	result	computations
{ }	
{
  "1620195120": 69.96666666666665,
  "1620195150": 5.816666666666666,
  "1620195180": 5.766666666666667,
  "1620195210": 4.3,
  "1620195240": 5.7666666666666675,
  "1620195270": 3.7666666666666675,
  "1620195300": 4.4,
  "1620195330": 4.933333333333334,
  "1620195360": 4.033333333333334,
  "1620195390": 1.7000000000000002,
  "1620198720": 69.93333333333334,
  "1620198750": 11.7,
  "1620198780": 1.2999999999999998,
  "1620198810": 1.8500000000000008,
  "1620198840": 2.766666666666667,
  "1620198870": 4.633333333333333,
  "1620198900": 4.833333333333334,
  "1620198930": 2.366666666666667,
  "1620198960": 2.366666666666667,
  "1620198990": 2.2666666666666666
}



> overQuery("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "60m", "1m", 2)

group	result	computations
{ shift=1m0s }	
{
  "1620198780": 69.93333333333334,
  "1620198810": 11.7,
  "1620198840": 1.2999999999999998,
  "1620198870": 1.8500000000000008,
  "1620198900": 2.766666666666667,
  "1620198930": 4.633333333333333,
  "1620198960": 4.833333333333334,
  "1620198990": 2.366666666666667,
  "1620199020": 2.366666666666667,
  "1620199050": 2.2666666666666666
}
{ shift=1h1m0s }	
{
  "1620198780": 69.96666666666665,
  "1620198810": 5.816666666666666,
  "1620198840": 5.766666666666667,
  "1620198870": 4.3,
  "1620198900": 5.7666666666666675,
  "1620198930": 3.7666666666666675,
  "1620198960": 4.4,
  "1620198990": 4.933333333333334,
  "1620199020": 4.033333333333334,
  "1620199050": 1.7000000000000002
}

bandQuery and overQuery For the same time period of a query cycle ( For example, at this time of day ) Our indicators are very useful , And what's interesting is bandQuery It doesn't produce unjoined group, This is further explained in the following tips .

window

window(query string, duration string, period string, num scalar, funcName string) seriesSet

Compared with bandQuery and overQuery,window More useful for queries for presentation purposes , window The results of each query will be funcName Of reduction Calculation , The returned value and timestamp generate a new time series . for instance , You want to check the past 6 The number of requests per hour within an hour , You can use the following calculation method :

> window("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "60m", "60m", 6, "sum")

group	result	computations
{ }	
{
  "1620175620": 356260.0166666666,
  "1620179220": 370473.99999999965,
  "1620182820": 391460.0166666665,
  "1620186420": 405893.36666666664,
  "1620190020": 364280.9166666666,
  "1620193620": 380179.3833333336
}

coordination grafana You can draw such a curve or histogram

count/change

count Indicates that the query returns Results length , and change Indicates change , change("avg:rate:net.bytes", "60m", "") = avg(q("avg:rate:net.bytes", "60m", "")) * 60 * 60

Calculation

bosun The way we calculate is probably the most disturbing part , To understand this , First of all, we should understand several cores in combination with the concepts in Section 1 :

  1. Most of the returned results of a query are a set of SeriesSet perhaps NumberSet namely Results, For example, we use... When querying In this way query: avg:rate:net.bytes{host=*}, Will automatically generate multiple group Of SeriesSet ( If not , It's just that screening can be written like this avg:rate:net.bytes{}{host=1.2.3.4})
  2. bosun Most of the functions in the documentation are for a single group Of SeriesSet, That is, when applying functions to query results , Yes for each. group By application function , such as avg(q("avg:rate:net.bytes{host=*}", "60m", "")) The results returned by the query are {host=a}, {host=b} wait , So for many group Separate application avg function
  3. Different Results Calculate each other , for instance +, It's for all group The combination is applied separately + Calculate , But not all group All combinations can calculate each other , Only those that are subsets or equal to each other group To calculate , So there will be unjoined group, Not involved in the calculation group There will be a unjoined group, This calculation is a bit abstract , You can see the following examples to help understand . You can guess the result before you look at it , Make sure your understanding is correct .
#  Two  results  The operation mode between 
for g1 in Result1:
    for g2 in Result2:
        if g1 == g2 || g1 is subset of g2 || g2 is subset of g1:
             Calculation 

for g1 in Result1:
    if g1  Not involved in the calculation :
         Generate a  unjoined group
for g2 in Result2:
    if g2  Not involved in the calculation :
         Generate a  unjoined group

Example 1

$a = series("X=a1,Y=b1", 100, 1, 200, 2)
$b = series("X=a2,Y=b2", 100, 2, 200, 3)


$x = series("X=a1", 100, 2, 200, 1)
$y = series("X=a1,Y=b2", 100, 3, 200, 5)
$z = series("X=a2,Y=b2", 100, 3, 200, 2)

# {X=a1,Y=b1} {X=a2,Y=b2}
$ab = merge($a, $b)

# {X=a1} {X=a1,Y=b2} {X=a2,Y=b2}
$xyz = merge($x, $y, $z)


#  The combinations that can participate in the calculation here are  ({X=a1,Y=b1}, {X=a1}), ({X=a2,Y=b2}, {X=a2,Y=b2}),  because  {X=a1,Y=b2}  Not involved in the calculation , So it will generate a  unjoined group
$ab+$xyz


-----------------------------------
group	result	computations
{ X=a1, Y=b1 }	
{
  "100": 3,
  "200": 3
}
{ X=a2, Y=b2 }	
{
  "100": 5,
  "200": 5
}
{ X=a1, Y=b2 }	
{
  "100": "NaN",
  "200": "NaN"
}
merge(series("X=a1,Y=b1", 100, 1, 200, 2), series("X=a2,Y=b2", 100, 2, 200, 3)) + merge(series("X=a1", 100, 2, 200, 1), series("X=a1,Y=b2", 100, 3, 200, 5), series("X=a2,Y=b2", 100, 3, 200, 2))	unjoined group (NaN)

Example 2

$a = series("Y=b2", 100, 1, 200, 1)
$b = series("X=a1,Y=b1", 100, 3, 200, 5)
$c = series("X=a2,Y=b2", 100, 3, 200, 2)


$x = series("X=a2", 100, 2, 200, 1)
$y = series("X=a1,Y=b2", 100, 3, 200, 5)
$z = series("X=a2,Y=b2", 100, 3, 200, 2)

# {X=a1,Y=b1} {X=a2,Y=b2} {Y=b2}
$abc = merge($b, $c, $a)

# {X=a2,Y=b2} {X=a1,Y=b2} {X=a2},  Here is the  {X=a2}   It is placed last because if the first combination cannot be calculated, an error will be reported 
$xyz = merge($z, $y, $x)

$abc + $xyz 


-----------------------------------
group	result	computations
{ X=a2, Y=b2 }	
{
  "100": 6,
  "200": 4
}
{ X=a2, Y=b2 }	
{
  "100": 4,
  "200": 3
}
{ X=a1, Y=b2 }	
{
  "100": 4,
  "200": 6
}
{ X=a2, Y=b2 }	
{
  "100": 5,
  "200": 3
}
{ X=a1, Y=b1 }		
merge(series("X=a2,Y=b2", 100, 3, 200, 2), series("X=a1,Y=b2", 100, 3, 200, 5), series("X=a2", 100, 2, 200, 1)) + merge(series("X=a1,Y=b1", 100, 3, 200, 5), series("X=a2,Y=b2", 100, 3, 200, 2), series("Y=b2", 100, 1, 200, 1))

More examples

$aa=series("tagA=a", 0, 2, 60, 2)
$ab=series("tagA=a,tagB=b", 0, 2, 60, 1)
$ac=series("tagA=a,tagC=c", 0, 2, 60, 3)
$bb=series("tagB=b", 0, 2, 60, 2)
$cc=series("tagC=c", 0, 2, 60, 2)

# {tagA=a} {tagB=b} {tagC=c}
$abc=merge($aa,$bb,$cc)
# {tagA=a} {tagA=a,tagB=b}
$aab = merge($aa, $ab)  
# {tagA=a} {tagA=a,tagC=c}
$aac = merge($aa, $ac)

#  The combinations that can participate in the calculation here are  ({tagA=a}, {tagA=a})  ({tagA=a},{tagA=a,tagC=c}) ({tagA=a,tagB=b},{tagA=a}) 
# $aab+$aac


#  {tagA=a} {tagB=b}
$aabb = merge($aa, $bb)  
#  {tagA=a} {tagC=c}
$aacc = merge($aa, $cc)


# $aacc+$aabb

#  {tagA=a} {tagC=c} + {tagA=a} {tagA=a,tagC=c}
# $aacc+$aac


#  {tagA=a} {tagC=c} +  {tagA=a} {tagB=b} {tagC=c}
# $aacc+$abc

skill

avoid unjoined group

A common practice is to use group Some related operation functions , For example, when querying, it simply does not generate group, Use filter Statement query , such as avg(q("sum:rate:metrics.notexist{}{status=500)}", "1m", "0m")), Or use after query addtags, remove Such a function to handle tags, To avoid group Incompatibility between . Here is another ingenious approach , Can be ignored unjoined group. That is to use bandQuery To query ,

For example, an example of calculating the request error rate :

$key_err = "sum:rate{counter}:${service}.rpc.calledby.error.throughput{method=*}"
$key_succ = "sum:rate{counter}:${service}.rpc.calledby.success.throughput{method=*}"

$err_now = avg(q($key_err, "5m", "1m"))
$succ_now = avg(q($key_succ, "5m", "1m"))

$rate_now = $err_now / ($err_now +$succ_now)
$rate_now

Using the above query method will produce a large number of unjoined group, as a result of rpc.calledby.error.throughput The of this indicator tags Quantity ratio success A lot less , But I hope that the returned results can bring method This grouping label . Use band The query method of is as follows :

$key_err = "sum:rate{counter}:${service}.rpc.calledby.error.throughput{method=*}"
$key_succ = "sum:rate{counter}:${service}.rpc.calledby.success.throughput{method=*}"

$err_now = avg(band($key_err, "4m", "1m", 1))
$succ_now = avg(band($key_succ, "4m", "1m", 1))

$rate_now = $err_now / ($err_now +$succ_now)
$rate_now

Use band The query will not produce unjoined group,unjoined group The results will be ignored , namely results In the calculation between , Generate unjoined group The steps of will be ignored .

grafana bosun plug-in unit

grafana bosun plug-in unit There are two built-in variables in

  • $ds: Suggested downsampling interval, This variable is very useful , In the use of queries, such as q("avg:$ds-avg:os.disk.fs.space_free{disk=*,host=backup}", "$start", ""), The query efficiency will be maintained when the user selects a large time range .
  • $start: User selected start time

t Use of functions

group Operation function of There are several , Here is an introduction t function , He can put multiple group Of seriesSet join Become a group Of , To cooperate with some calculation functions . for instance , Calculation api Of 60 min weighting latency:

$latency=avg(q("avg:${service}.calledby.success.latency.us.pct99{handle_method=*}", "60m", ""))
$count=sum(q("sum:rate{counter,,,diff}:${service}.calledby.success.throughput{handle_method=*}", "60m", ""))
$total=sum(q("sum:rate{counter,,,diff}:${service}.calledby.success.throughput{}", "60m", ""))


sum(t($latency*($count/$total), ""))

Other reference

原网站

版权声明
本文为[Wang Lei -ai Foundation]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/05/20210516143937704L.html