当前位置:网站首页>Cloud native monitoring configuration self built alertmanager to realize alarm

Cloud native monitoring configuration self built alertmanager to realize alarm

2022-06-24 17:15:00 Nieweixing

At present k8s The main monitoring software of is prometheus, In order to better monitor the tke colony , Tencent cloud also launched prometheus Service for , It's called cloud native monitoring , Cloud native monitoring can monitor our tke colony , Of course, it also supports configuring alarms , The alarm of cloud native monitoring is also adopted alertmanager, Self built and default configurations are supported here , If you don't deploy yourself alertmanager, Cloud native monitoring will deploy one in the background alertmanager To configure and generate alarms , But the default deployment alertmanager To adapt to Tencent cloud , For the time being, only Tencent cloud's message generation channels and webhook.

But sometimes we need to send the alarm to our own chat software , Such as slack, Enterprise WeChat , Mailbox, etc , So here we need to use self built alertmanager To implement the , Today, let's talk about how to configure self built in cloud native monitoring alertmanager An alarm occurs on our enterprise wechat .

1. Deploy alertmanager

First, we deploy a in our cluster alertmanager, And then through an intranet LoadBalancer type service To expose the services provided to the cloud native monitoring instance for calling .

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitor
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: alertmanager
      qcloud-app: alertmanager
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: alertmanager
        qcloud-app: alertmanager
    spec:
      containers:
      - args:
        - --config.file=/etc/alertmanager/config.yml
        - --storage.path=/alertmanager/data
        image: prom/alertmanager:v0.15.3
        imagePullPolicy: Always
        name: alertmanager
        resources:
          limits:
            cpu: 500m
            memory: 1Gi
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/alertmanager
          name: alertcfg
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: qcloudregistrykey
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 511
          name: alertmanager
        name: alertcfg

You also need to deploy the corresponding alertmanager Of configmap, Here, you need to configure the enterprise wechat channel for receiving alarm messages , Specific enterprise application methods can be Baidu , The corresponding enterprise wechat application secret key can be obtained by referring to the following notes , Here I have applied for a personal enterprise wechat to test alarm reception .

apiVersion: v1
data:
  config.yml: |
    global:
      resolve_timeout: 5m

    route:
      group_by: ['alertname']
      group_interval: 1m
      group_wait: 10s
      receiver: default-receiver
      repeat_interval: 1m

    receivers:
    - name: default-receiver
      wechat_configs:
      - corp_id: 'ww0c31105f29c8'         #  Enterprise information (" My business "--->"CorpID"[ At the bottom ])
        to_user: '@all'             #  Everyone is @all, Or a designated person 
        agent_id: '100002'         #  Enterprise WeChat (" Enterprise applications "-->" Custom application "[Prometheus]--> "AgentId")
        api_secret: 'BXllYvWYXBy4HH9itlPzd9T-e2JfWP9E'   #  Enterprise WeChat (" Enterprise applications "-->" Custom application "[Prometheus]--> "Secret")
        send_resolved: true         # When the problem is solved, send a message 
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    kubernetes.io/cluster-service: "true"
  name: alertmanager
  namespace: monitor

Here we attach to 163 Configuration of mailbox alarm , If you want to use the email to accept the alarm , You can use this cm To configure .

apiVersion: v1
data:
  config.yml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'HYLVOJCTU' # The password here is the authorization code of the email , You can go to the mailbox settings to get 
      smtp_require_tls: false


    route:
      group_by: ['alertname']
      group_interval: 1m
      group_wait: 10s
      receiver: default-receiver
      repeat_interval: 1m

    receivers:
    - name: default-receiver
      email_configs:
      - to: "[email protected]"
kind: ConfigMap
metadata:
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    kubernetes.io/cluster-service: "true"
  name: alertmanager
  namespace: monitor

Here to alertmanager Deploy a service Provide access to cloud native monitoring instances ,service After deployment ,alertmanager The access portal for is 10.0.0.143:9093

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.cloud.tencent.com/direct-access: "true"
    service.kubernetes.io/loadbalance-id: lb-n1jjuq
    service.kubernetes.io/qcloud-loadbalancer-clusterid: cls-b3mg1p92
    service.kubernetes.io/qcloud-loadbalancer-internal-subnetid: subnet-ktam6hp8
  name: alertmanager
  namespace: monitor
spec:
  clusterIP: 172.16.56.208
  externalTrafficPolicy: Cluster
  ports:
  - name: 9093-9093-tcp
    nodePort: 32552
    port: 9093
    protocol: TCP
    targetPort: 9093
  selector:
    k8s-app: alertmanager
    qcloud-app: alertmanager
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - ip: 10.0.0.143

Here we built it by ourselves alertmanager The deployment is complete , Let's deploy the corresponding cloud native monitoring instance .

2. Create a cloud native monitoring instance

On the console of the container service, click cloud native monitoring to create an instance , Here you need to click Advanced settings , Then click Add alertmanager, Enter your deployed alertmanager Of service Access portal 10.0.0.143.9093.

It should be noted here that if you select the default deployment when creating cloud native monitoring alertmanager, The interface switch to self built is not supported yet alertmanager, If you need to switch, you need to submit the work order to the Engineer for switching , Therefore, it is recommended to select self built when creating alertmanager.

After the instance is created , In the basic information of the instance, the self built configuration will be displayed alertmanager and prometheus And so on

3. relation tke colony

After the cloud native monitoring instance is created , Actually prometheus The service does not monitor any k8s colony , We need to tke Cluster to join our cloud native monitoring for data collection , We associate our in an association cluster tke Just cluster .

After the cluster is associated , We can see our associated cluster information on the console , You can click on the target Go to check whether the collection status is healthy

We can also go to prometheus The query interface of is used to query data , look down tke Whether the monitoring of the cluster has collected prometheus.

Click data query , If there is a result returned , explain prometheus collection tke The monitoring data of the cluster is successful .

4. Configure alarms

Let's write and configure alarm rules , Let's test the alarm of node memory utilization , In order to better trigger the alarm , The memory utilization of the nodes here exceeds 10%, Let's call the police , First of all we have prometheus Of ui Page progress promsql Write alarm rules .

100 - (node_memory_MemFree_bytes{endpoint !="target"}+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 10

Here we can use the above sql The query shows that the memory utilization rate is greater than 10% The node of , Next, we go to the alarm configuration console of cloud native monitoring to configure alarms

  • Rule name : Name of alarm rule , No more than 40 Characters .
  • PromQL: Alarm rule statement .
  • The duration of the : The time when the conditions described in the above statement are met , Reaching this duration will trigger an alarm .
  • Label: Add... To each rule Prometheus label .
  • Alarm content : After the alarm is triggered, the specific contents of the alarm notification are sent through e-mail or SMS , The alarms configured here are as follows
{{$labels.cluster}} Of {{$labels.instance}}  The memory of the node exceeds the alarm threshold 10% , The current memory usage is  {{$value}} , Please pay attention to and deal with it in time !!!

5. Enterprise wechat viewing alarm

[FIRING:3] NodeMemoryUsage (node.rules cls-b3mg1p92 tke node-exporter kube-system mem alert-7pjasfmm tke-node-exporter)
node.rules  test tke colony node Node memory alarm  alert-7pjasfmm
Alerts Firing:
Labels:
 - alertname = NodeMemoryUsage
 - alertName = node.rules
 - cluster = cls-b3mg1p92
 - cluster_type = tke
 - instance = 10.0.0.10
 - job = node-exporter
 - namespace = kube-system
 - node = mem
 - notification = alert-7pjasfmm
 - pod = tke-node-exporter-xnfvb
 - service = tke-node-exporter
Annotations:
 - alertName = node.rules
 - content = cls-b3mg1p92 Of 10.0.0.10  The memory of the node exceeds the alarm threshold 10% , The current memory usage is  52.578219305872885 , Please pay attention to and deal with it in time !!!

 - describe =  test tke colony node Node memory alarm 
 - notification = alert-7pjasfmm
Source: /graph?g0.expr=100+-+%28node_memory_MemFree_bytes%7Bendpoint%21%3D%22target%22%7D+%2B+node_memory_Cached_bytes+%2B+node_memory_Buffers_bytes%29+%2F+node_memory_MemTotal_bytes+%2A+100+%3E+10&g0.tab=1
Labels:
 - alertname = NodeMemoryUsage
 - alertName = node.rules
 - cluster = cls-b3mg1p92
 - cluster_type = tke
 - instance = 10.0.0.157
 - job = node-exporter
 - namespace = kube-system
 - node = mem
 - notification = alert-7pjasfmm
 - pod = tke-node-exporter-vcnjl
 - service = tke-node-exporter
Annotations:
 - alertName = node.rules
 - content = cls-b3mg1p92 Of 10.0.0.157  The memory of the node exceeds the alarm threshold 10% , The current memory usage is  34.298334259939 , Please pay attention to and deal with it in time !!!

 - describe =  test tke colony node Node memory alarm 
 - notification = alert-7pjasfmm
Source: /graph?g0.expr=100+-+%28node_memory_MemFree_bytes%7Bendpoint%21%3D%22target%22%7D+%2B+node_memory_Cached_bytes+%2B+node_memory_Buffers_bytes%29+%2F+node_memory_MemTotal_bytes+%2A+100+%3E+10&g0.tab=1
Labels:
 - alertname = NodeMemoryUsage
 - alertName = node.rules
 - cluster = cls-b3mg1p92
 - cluster_type = tke
 - instance = 10.0.0.3
 - job = node-exporter
 - namespace = kube-system
 - node = mem
 - notification = alert-7pjasfmm
 - pod = tke-node-exporter-vpcmf
 - service = tke-node-exporter
Annotations:
 - alertName = node.rules
 - content = cls-b3mg1p92 Of 10.0.0.3  The memory of the node exceeds the alarm threshold 10% , The current memory usage is  31.307402547932455 , Please pay attention to and deal with it in time !!!

We go to our enterprise wechat prometheus Check whether the alarm occurs , Check whether you can receive the alarm information , It shows that we have successfully passed the self built alertmanager When an alarm occurs, the enterprise wechat succeeds .

6. Email receiving alarm

Next, we change the threshold value of the alarm to memory utilization exceeding 50%, Clustered alertmanagerconfigmap Change to the configuration information of the mailbox , Use the email to accept the alarm and see what it looks like , By default, only one node should send an alarm , Let's test .

From the above promsql Look at the query results , Only 10.0.0.10 Memory usage exceeded 50%, So our email only received 10.0.0.10 The alarm email of this node , Explain through alertmanager Sending alarm to our mailbox has been successful .

原网站

版权声明
本文为[Nieweixing]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/03/20210329142553797z.html