当前位置：网站首页>Bosun query

Bosun query

2022-06-24 15:20:00 【Wang Lei -ai Foundation】

background

bosun It's a by Stack Exchange Open source monitoring and alarm system , The tools that can be benchmarked are prometheus Of alertmanager. bosun Is designed to work with a variety of tsdb Configure the monitoring alarm system , however bosun It also provides a set of dsl Used to query and monitor 、 Evaluation indicators , bring bosun It's also a kind of tsdb irrelevant （ Current support such as opentsdb, prometheus, influxdb, es Many other kinds tsdb Back end ） Index query language . To understand bosun How to generate an alarm , Or just use his index query ability , Coordination as grafana Such a monitoring front-end to display indicators , Then you must understand the language .

bosun It is not a very hot project , There are 3.1k star, There are few documents about him in the market , Most of them are literal translations of official documents . The purpose of this article is to introduce bosun How to query （ Mainly for the backend opentsdb）, And some query skills .

Concept

First of all, understand bosun Some type concepts in queries ：

Scalar It's just a number
NumberSet and Scalar It's basically one thing , But there is one more group tag, empty {} It is tag
SeriesSet yes The most common format for characterizing the original indicator , and NumberSet Different , Its corresponding value is not a number , It's a set of associated timestamp values , such as Time 100 The next value 3.14, Time 200 The next value 3.28
Results It is not a concept introduced in the document , It is the most common type of query in practice , It represents the most common result of a query ： It's a group tag Different SeriesSet perhaps NumberSet Etc . There are different in the document tags Combination is also called group.

Inquire about

in the light of opentsdb Query in ,bosun Several query methods are provided

q

q(query string, startDuration string, endDuration string) seriesSet

This is the most commonly used query method , Most alarms are also queried with this statement , This statement is very simple , among query yes opentsdb Of Query statement ,startDuration and endDuration Is the start and end query time , such as q(sum:rate{counter}:sys.cpu.user, 5m, 1m), Represents a query sys.cpu.user indicators 5m Forward to 1m Some time ago sum:rate. There is a delay in the collection of indicators ,endDuration It is generally recommended that at least 1m front .

#  Example  
q("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "1m")

group	result	computations
{ }	
{
  "1620196950": 2.016666666666667,
  "1620196980": 2.3666666666666667,
  "1620197010": 1.0999999999999999,
  "1620197040": 1.8333333333333335,
  "1620197070": 2.7333333333333343,
  "1620197100": 2.5,
  "1620197130": 1.7000000000000002,
  "1620197160": 0.9666666666666666
}

bandQuery/overQuery

bandQuery(query string, duration string, period string, eduration string, num scalar) seriesSet band(query string, duration string, period string, num scalar) seriesSet

bandQuery It means use query Statement is executed multiple times (num Time ) Inquire about , The time range of each query is determined by duration/period decision ,
band yes bandQuery A special form of , It's equivalent to setting up eduration = period, such as band("avg:os.cpu", "1h", "1d", 3) It is equivalent to querying the following three statements q("avg:os.cpu", "25h", "1d"), q("avg:os.cpu", "49h", "2d"), q("avg:os.cpu", "73h", "3d"), Because it's set up eduration=period, So the latest cycle is (period+duration,period)

overQuery(query string, duration string, period string, eduration string, num scalar) seriesSet over(query string, duration string, period string, num scalar) seriesSet shiftBand(query string, duration string, period string, num scalar) seriesSet

overQuery yes over and shiftBand The common form of , and bandQuery The difference is that , After the query, the query result will be tagged with query offset "shifted"
over and shiftBand It's just overQuery A special form of , It is just equivalent to giving overQuery Of eduration Set to period and current time ( That is, do not fill in )

#  Example , because   The rest are just  bandQuery  and  overQuery  A special form of , Here are just two examples of these queries 

> bandQuery("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "60m", "1m", 2)


group	result	computations
{ }	
{
  "1620195120": 69.96666666666665,
  "1620195150": 5.816666666666666,
  "1620195180": 5.766666666666667,
  "1620195210": 4.3,
  "1620195240": 5.7666666666666675,
  "1620195270": 3.7666666666666675,
  "1620195300": 4.4,
  "1620195330": 4.933333333333334,
  "1620195360": 4.033333333333334,
  "1620195390": 1.7000000000000002,
  "1620198720": 69.93333333333334,
  "1620198750": 11.7,
  "1620198780": 1.2999999999999998,
  "1620198810": 1.8500000000000008,
  "1620198840": 2.766666666666667,
  "1620198870": 4.633333333333333,
  "1620198900": 4.833333333333334,
  "1620198930": 2.366666666666667,
  "1620198960": 2.366666666666667,
  "1620198990": 2.2666666666666666
}



> overQuery("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "5m", "60m", "1m", 2)

group	result	computations
{ shift=1m0s }	
{
  "1620198780": 69.93333333333334,
  "1620198810": 11.7,
  "1620198840": 1.2999999999999998,
  "1620198870": 1.8500000000000008,
  "1620198900": 2.766666666666667,
  "1620198930": 4.633333333333333,
  "1620198960": 4.833333333333334,
  "1620198990": 2.366666666666667,
  "1620199020": 2.366666666666667,
  "1620199050": 2.2666666666666666
}
{ shift=1h1m0s }	
{
  "1620198780": 69.96666666666665,
  "1620198810": 5.816666666666666,
  "1620198840": 5.766666666666667,
  "1620198870": 4.3,
  "1620198900": 5.7666666666666675,
  "1620198930": 3.7666666666666675,
  "1620198960": 4.4,
  "1620198990": 4.933333333333334,
  "1620199020": 4.033333333333334,
  "1620199050": 1.7000000000000002
}

bandQuery and overQuery For the same time period of a query cycle （ For example, at this time of day ） Our indicators are very useful , And what's interesting is bandQuery It doesn't produce unjoined group, This is further explained in the following tips .

window

window(query string, duration string, period string, num scalar, funcName string) seriesSet

Compared with bandQuery and overQuery,window More useful for queries for presentation purposes , window The results of each query will be funcName Of reduction Calculation , The returned value and timestamp generate a new time series . for instance , You want to check the past 6 The number of requests per hour within an hour , You can use the following calculation method :

> window("sum:rate{counter}:${service}.rpc.calledby.success.throughput", "60m", "60m", 6, "sum")

group	result	computations
{ }	
{
  "1620175620": 356260.0166666666,
  "1620179220": 370473.99999999965,
  "1620182820": 391460.0166666665,
  "1620186420": 405893.36666666664,
  "1620190020": 364280.9166666666,
  "1620193620": 380179.3833333336
}

coordination grafana You can draw such a curve or histogram

count/change

count Indicates that the query returns Results length , and change Indicates change , change("avg:rate:net.bytes", "60m", "") = avg(q("avg:rate:net.bytes", "60m", "")) * 60 * 60

Calculation

bosun The way we calculate is probably the most disturbing part , To understand this , First of all, we should understand several cores in combination with the concepts in Section 1 :

Most of the returned results of a query are a set of SeriesSet perhaps NumberSet namely Results, For example, we use... When querying In this way query: avg:rate:net.bytes{host=*}, Will automatically generate multiple group Of SeriesSet ( If not , It's just that screening can be written like this avg:rate:net.bytes{}{host=1.2.3.4})
bosun Most of the functions in the documentation are for a single group Of SeriesSet, That is, when applying functions to query results , Yes for each. group By application function , such as avg(q("avg:rate:net.bytes{host=*}", "60m", "")) The results returned by the query are {host=a}, {host=b} wait , So for many group Separate application avg function
Different Results Calculate each other , for instance +, It's for all group The combination is applied separately + Calculate , But not all group All combinations can calculate each other , Only those that are subsets or equal to each other group To calculate , So there will be unjoined group, Not involved in the calculation group There will be a unjoined group, This calculation is a bit abstract , You can see the following examples to help understand . You can guess the result before you look at it , Make sure your understanding is correct .

#  Two  results  The operation mode between 
for g1 in Result1:
    for g2 in Result2:
        if g1 == g2 || g1 is subset of g2 || g2 is subset of g1:
             Calculation 

for g1 in Result1:
    if g1  Not involved in the calculation :
         Generate a  unjoined group
for g2 in Result2:
    if g2  Not involved in the calculation :
         Generate a  unjoined group

Example 1

$a = series("X=a1,Y=b1", 100, 1, 200, 2)
$b = series("X=a2,Y=b2", 100, 2, 200, 3)


$x = series("X=a1", 100, 2, 200, 1)
$y = series("X=a1,Y=b2", 100, 3, 200, 5)
$z = series("X=a2,Y=b2", 100, 3, 200, 2)

# {X=a1,Y=b1} {X=a2,Y=b2}
$ab = merge($a, $b)

# {X=a1} {X=a1,Y=b2} {X=a2,Y=b2}
$xyz = merge($x, $y, $z)


#  The combinations that can participate in the calculation here are  ({X=a1,Y=b1}, {X=a1}), ({X=a2,Y=b2}, {X=a2,Y=b2}),  because  {X=a1,Y=b2}  Not involved in the calculation , So it will generate a  unjoined group
$ab+$xyz


-----------------------------------
group	result	computations
{ X=a1, Y=b1 }	
{
  "100": 3,
  "200": 3
}
{ X=a2, Y=b2 }	
{
  "100": 5,
  "200": 5
}
{ X=a1, Y=b2 }	
{
  "100": "NaN",
  "200": "NaN"
}
merge(series("X=a1,Y=b1", 100, 1, 200, 2), series("X=a2,Y=b2", 100, 2, 200, 3)) + merge(series("X=a1", 100, 2, 200, 1), series("X=a1,Y=b2", 100, 3, 200, 5), series("X=a2,Y=b2", 100, 3, 200, 2))	unjoined group (NaN)

Example 2

$a = series("Y=b2", 100, 1, 200, 1)
$b = series("X=a1,Y=b1", 100, 3, 200, 5)
$c = series("X=a2,Y=b2", 100, 3, 200, 2)


$x = series("X=a2", 100, 2, 200, 1)
$y = series("X=a1,Y=b2", 100, 3, 200, 5)
$z = series("X=a2,Y=b2", 100, 3, 200, 2)

# {X=a1,Y=b1} {X=a2,Y=b2} {Y=b2}
$abc = merge($b, $c, $a)

# {X=a2,Y=b2} {X=a1,Y=b2} {X=a2},  Here is the  {X=a2}   It is placed last because if the first combination cannot be calculated, an error will be reported 
$xyz = merge($z, $y, $x)

$abc + $xyz 


-----------------------------------
group	result	computations
{ X=a2, Y=b2 }	
{
  "100": 6,
  "200": 4
}
{ X=a2, Y=b2 }	
{
  "100": 4,
  "200": 3
}
{ X=a1, Y=b2 }	
{
  "100": 4,
  "200": 6
}
{ X=a2, Y=b2 }	
{
  "100": 5,
  "200": 3
}
{ X=a1, Y=b1 }		
merge(series("X=a2,Y=b2", 100, 3, 200, 2), series("X=a1,Y=b2", 100, 3, 200, 5), series("X=a2", 100, 2, 200, 1)) + merge(series("X=a1,Y=b1", 100, 3, 200, 5), series("X=a2,Y=b2", 100, 3, 200, 2), series("Y=b2", 100, 1, 200, 1))

More examples

$aa=series("tagA=a", 0, 2, 60, 2)
$ab=series("tagA=a,tagB=b", 0, 2, 60, 1)
$ac=series("tagA=a,tagC=c", 0, 2, 60, 3)
$bb=series("tagB=b", 0, 2, 60, 2)
$cc=series("tagC=c", 0, 2, 60, 2)

# {tagA=a} {tagB=b} {tagC=c}
$abc=merge($aa,$bb,$cc)
# {tagA=a} {tagA=a,tagB=b}
$aab = merge($aa, $ab)  
# {tagA=a} {tagA=a,tagC=c}
$aac = merge($aa, $ac)

#  The combinations that can participate in the calculation here are  ({tagA=a}, {tagA=a})  ({tagA=a},{tagA=a,tagC=c}) ({tagA=a,tagB=b},{tagA=a}) 
# $aab+$aac


#  {tagA=a} {tagB=b}
$aabb = merge($aa, $bb)  
#  {tagA=a} {tagC=c}
$aacc = merge($aa, $cc)


# $aacc+$aabb

#  {tagA=a} {tagC=c} + {tagA=a} {tagA=a,tagC=c}
# $aacc+$aac


#  {tagA=a} {tagC=c} +  {tagA=a} {tagB=b} {tagC=c}
# $aacc+$abc

skill

avoid unjoined group

A common practice is to use group Some related operation functions , For example, when querying, it simply does not generate group, Use filter Statement query , such as avg(q("sum:rate:metrics.notexist{}{status=500)}", "1m", "0m")), Or use after query addtags, remove Such a function to handle tags, To avoid group Incompatibility between . Here is another ingenious approach , Can be ignored unjoined group. That is to use bandQuery To query ,

For example, an example of calculating the request error rate :

$key_err = "sum:rate{counter}:${service}.rpc.calledby.error.throughput{method=*}"
$key_succ = "sum:rate{counter}:${service}.rpc.calledby.success.throughput{method=*}"

$err_now = avg(q($key_err, "5m", "1m"))
$succ_now = avg(q($key_succ, "5m", "1m"))

$rate_now = $err_now / ($err_now +$succ_now)
$rate_now

Using the above query method will produce a large number of unjoined group, as a result of rpc.calledby.error.throughput The of this indicator tags Quantity ratio success A lot less , But I hope that the returned results can bring method This grouping label . Use band The query method of is as follows :

$key_err = "sum:rate{counter}:${service}.rpc.calledby.error.throughput{method=*}"
$key_succ = "sum:rate{counter}:${service}.rpc.calledby.success.throughput{method=*}"

$err_now = avg(band($key_err, "4m", "1m", 1))
$succ_now = avg(band($key_succ, "4m", "1m", 1))

$rate_now = $err_now / ($err_now +$succ_now)
$rate_now

Use band The query will not produce unjoined group,unjoined group The results will be ignored , namely results In the calculation between , Generate unjoined group The steps of will be ignored .

grafana bosun plug-in unit

grafana bosun plug-in unit There are two built-in variables in

$ds: Suggested downsampling interval, This variable is very useful , In the use of queries, such as q("avg:$ds-avg:os.disk.fs.space_free{disk=*,host=backup}", "$start", ""), The query efficiency will be maintained when the user selects a large time range .
$start: User selected start time

t Use of functions

group Operation function of There are several , Here is an introduction t function , He can put multiple group Of seriesSet join Become a group Of , To cooperate with some calculation functions . for instance , Calculation api Of 60 min weighting latency:

$latency=avg(q("avg:${service}.calledby.success.latency.us.pct99{handle_method=*}", "60m", ""))
$count=sum(q("sum:rate{counter,,,diff}:${service}.calledby.success.throughput{handle_method=*}", "60m", ""))
$total=sum(q("sum:rate{counter,,,diff}:${service}.calledby.success.throughput{}", "60m", ""))


sum(t($latency*($count/$total), ""))