当前位置:网站首页>An implementation of warning bombing

An implementation of warning bombing

2022-06-23 03:27:00 fjywan

background

Monitoring and warning are like eyes , It is the window of observation application : The health of the service , Sense the abnormality in time . And the way to perceive anomalies , Is an alarm , WeChat 、 mail 、 SMS , Whatever the way , The purpose is to remind the service 「 Probably 」 Existing problems .

The alarm , According to the content, it can be divided into two categories :

  • Indicator based alarm
  • Log based alarms

indicators (metric): It is usually aggregated from logs , For example, the average time spent 、500 And so on . When the indicator exceeds a certain threshold , Triggered alarm , Classified as indicator based alarm .

journal : It is the behavior flow of service , The most detailed content . When a error Type of log , Triggered alarm , Classified as log based alarms .

From the definition of the above classification , Easy to see , The alarm based on log is the easiest to form alarm bombing , such as :

  1. On a call link , Something unusual , This will often lead to exceptions in all subsequent nodes , A series of abnormal logs lead to alarm bombing .
  2. The log rating is unreasonable , For example, the user input is illegal console.error To record , It belongs to the function of abusing alarms as reminders .
  3. 「 An anomaly in a normal situation 」, such as , When you find an alarm on the line, you don't have to worry about it , Because the verification of dependent services has changed , But we can't change the code to shield an alarm 、 Release .

The more invalid alarms are mixed , The more difficult it is to find abnormal problems , If it is allowed to spread , The alarm will eventually lose the function of timely sensing abnormalities .

Problem analysis

Carefully analyze the alarms that cause interference , Can be divided into :

  1. It does indicate that the service is abnormal :
    1. But the frequency is too high .
    2. The problem has been confirmed , In the process of repair and release , Interfere with other abnormal alarms .
  2. An alarm that does not indicate an abnormal service , It should be shielded , No more push .

No matter what kind of interference alarm , The root causes are : Lack of alarm feedback mechanism .

The alarm system should not only push the alarm , It should also be able to sense whether the development has handled the alarm .

Only the alarm system can sense how the development handles the alarm : Refuse to deal with 、 Accept processing 、 Ignore , According to the feedback , Adjust push .

Through analysis , It is clear to solve the invalid alarm , That is to add a feedback mechanism to the alarm system .

The project design

The core part of the whole scheme : How to respond to development feedback , Design push strategy .

Push strategy

For an alarm , There are three options for development :

  1. Ignore
  2. Refuse
  3. Accept

The push policy corresponding to each option :

  1. Ignore - Do not process for three consecutive times ( Neither refuse nor accept ), Stop pushing the same alarm within one day
  2. Refuse - Stop pushing the same alarm within three days
  3. Accept - Stop pushing the same alarm , And new BUG single , stay BUG The order status is changed to after repair , Restore alarm .

From the push strategy , There are several points that need to be further refined :

  1. How to determine the same alarm , That is, how to calculate the identification of alarm information
  2. Alarm and Bug Get through the single , as well as Bug Single status flow .

Alarm identification

Structured data is usually behind the alarm information , contain traceid、message、error stack etc. .

If the alarm message identical , Meaning the same , It can be judged as the same alarm . therefore , The alarm identification can be taken as message Before 100 byte .

Bug Order and status flow

First one Bug At least the following attributes should be recorded :

  • msgid: Alarm message identification
  • trace: Alarm link id, Used for logging system
  • assign: Dealing with people
  • status: bug Status of the order

Bug Status of the order status And circulation :

Realization

Take the enterprise wechat robot as an alarm tool ( For the usage of enterprise wechat robot, please refer to the developer documentation ).

The implementation of push

1. Get the callback address of the enterprise wechat robot

namely Webhook Address , When you create a new robot, you will be given :

2. Output the log to the robot

Use log4js As a logging tool library .

import log4js from 'log4js';

Development customization appender, Output log to robot

function robotAppender(layout, timezoneOffset) {
  return (loggingEvent) => {
    const logCtx = loggingEvent.context;

    //  If the log level is  error  above , senior 
    if ((loggingEvent.level as Level).isGreaterThanOrEqualTo(levels.ERROR)) {
      //  Call the robot alarm 
      sendAlert(`[${msgObj.level}]${projectName}`, {
        path: loggingEvent.context.path || '', // path
        ctx: ctxStr.length > ctxStrLimit ? requestDataStr : ctxStr, // ctx
        msg: (layout(loggingEvent, timezoneOffset) as string)?.slice(0, ctxStrLimit) || '', //  Log contents 
        trace: loggingEvent.context.trace || '', // trace_id
      });
      return true;
    }
  };
};

export function wxConfigure(config: any, layouts: any) {
  let layout = layouts.colouredLayout;
  if (config.layout) {
    layout = layouts.layout(config.layout.type, config.layout);
  }
  return robotAppender(layout, config.timezoneOffset);
}

//  Configuration to  log4js
log4js.configure({
  appenders: {
   console: {
      type: 'console',
    },
    //  Enterprise wechat robot notification 
    wx: {
      type: { configure: wxConfigure },
      layout: { type: 'basic' },
    },
  },
  categories: {
    default: { appenders: ['console', 'wx'], level: 'debug' },
  },
});

3. Encapsulate the alarm function sendAlert

Apply the sending strategy in the alarm function :

  • For the alarm determined as invalid ,redis Lock , Prevent sending again .
  • For each alarm sent , stay redis Li count , The same alarm has not been processed for more than three times , Perform locking .

Pay special attention here : stay redis To execute counting in key To set the expiration time , such as 1h、1d, Because the amount of logs is often very large , There is no failure mechanism that will redis Memory burst .

async function sendAlert(title: string, data: Record<string, any>, chatid?: string) {
    //  Calculate the alarm information identification , take  msg  Before  100  byte 
    const msgId = getMsgId(data.msg);

    //  Judge whether there is a lock first 
    const lockKey = `${msgId}_lock`;
    //  Use here  ioredis, skip  redisClient  Encapsulation 
    const lock = await defaultRedisClient.get(lockKey);
    if (lock) {
      console.log('lock exsit, skip alert', title, data);
      return;
    }

    //  Count 
    let rawCounter = await defaultRedisClient.get(msgId);
    //  If you haven't sent it before , initialization 
    if (!rawCounter) {
      rawCounter = '0';
    }
    const counter = parseInt(rawCounter, 10);
    //  If it has been sent  3 Times or more , Lock , This sending is prohibited 
    if (counter > 2) {
      // rm counter
      //  First  rm, Sure  rm  Failure , The alarm count will be entered next time 
      await defaultRedisClient.del(msgId);
      // add lock
      await defaultRedisClient.setex(lockKey, 1 * 24 * 60 * 60 * 1000, data?.trace);
      //  You can push the prompt :
      // (` Three unprocessed alarms : ${msgId}  \n\n\n
        //  The alarm push has been terminated ,24h  Recover after !
      // `, undefined, chatid);
      return;
    }
    //  Otherwise it's just a count plus one , Pay attention to the expiration time 
    await defaultRedisClient.setex(msgId, 1 * 24 * 60 * 60 * 1000, String(counter + 1));

    const copyedData = {
      env,
      ...data,
    };
    let content = `### ${title} \n`;
    Object.keys(copyedData).forEach((key) => {
      content += `> **${key}**: <font color="comment">${copyedData[key]}</font> \n\n\n`;
    });
    const msgObj = {
      chatid,
      msgtype: 'markdown',
      markdown: {
        content,
        //  Note that there : Button to collect feedback 
        attachments: [{
          callback_id: 'alert_feedback',
          actions: [{
            name: `reject_${data?.trace}`,
            text: ' Refuse ',
            type: 'button',
            //  Use here   Identification of the message :msg  Of   front  100  byte 
            value: msgId,
            replace_text: ' Rejected ',
            border_color: '2EAB49',
            text_color: '2EAB49',
          },
          {
            name: `accept_${data?.trace}`,
            text: ' Accept ',
            type: 'button',
            value: msgId,
            replace_text: ' Accepted ',
            border_color: '2EAB49',
            text_color: '2EAB49',
          },
          ],
        },
        ],
      },
    };

    // url  Callback address for robot 
    return axios.post(url, msgObj, {
      headers: {
        'Content-Type': 'application/json',
      },
    });
  }

Pay special attention to the... Passed in by calling the robot interface attachments, Feedback buttons can be added to each alarm , effect :

An easily overlooked point : How to set the... Of each button name、value.

From the code above, you can see :

{
  name: `accept_${data?.trace}`,
  value: msgId,
},

These two fields , When the user clicks the button , Call us back intact , therefore , Make good use of these two fields for data transmission :

  • msgid, It is the necessary information for locking , Also built bug Required fields of the order .
  • trace, Full link id, build bug Only need , For tracing to the log system .

Accept the message of the button click

The developer clicks the alarm button , At this time, the alarm push strategy should be adjusted , say concretely , Is to lock specific messages , Prevent push .

Here we will develop a HTTP Server, And correctly handle the verification request of enterprise wechat .( This part is a separate article )

Now focus on the processing after clicking the button : When the developer clicks the button , Enterprise wechat will launch a HTTP Request to us Server, After decrypting the requested data , You will get data similar to the following :

{
  From: { 
    UserId: 'xxxxxxx', 
    Name: 'fjywan', 
    Alias: 'fjywan' 
  },
  WebhookUrl: 'http://in.qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxx', 
  ChatId: 'xxxx', 
  GetChatInfoUrl: 'http://in.qyapi.weixin.qq.com/cgi-bin/webhook/get_chat_info?code=xxxxx',
  MsgId: 'xxxxx',
  ChatType: 'group', 
  MsgType: 'attachment',
  Attachment: {
    CallbackId: 'alert_feedback',
    Actions: { 
      Name: 'accept-traceidxxx',
      Value: 'msgidxxxx', 
      Type: 'button' 
    } 
  },
  TriggerId: 'xxxx',
}

Let's deal with this message :

function getLockKey(msgId: string) {
  return `${msgId}_lock`;
}
enum BugStatus {
  Created = 1,
  Processing = 2,
  Done = 3
}
export async function alertFeedBack(payload: AttachmentMsg) {
  const {
    From: {
      Alias,
    },
    Attachment: {
      Actions: {
        Name,
        Value,
      },
    } } = payload;
  const lockKey = getLockKey(Value);
  const [actualName, trace] = Name.split('_');
  //  If there is  counter, Remove... First 
  await defaultRedisClient.del(Value);

  try {
  //  Accept the alarm processing 
    if (actualName === 'accept') {
    //  Add a fail safe lock 
      await defaultRedisClient.setnx(lockKey, Name);
      const now = Date.now();
      //  Use here  ORM prisma  Go to  MYSQL  Insert a piece of data  bug  data 
      await prisma.bug_list.create({
        data: {
          assign: Alias,
          trace,
          msgId: Value,
          status: BugStatus.Created,
          updatedAt: now,
          createdAt: now,
        },
      });
    } else {
    //  Reject alarm processing 
    // redis  Lock ,3 Days , There are no reminders behind 
    //  If you push three consecutive , The user does not handle , Lock it for a day 
      await defaultRedisClient.setex(lockKey, 3 * 24 * 60 * 60 * 1000, Name);
    }
  } catch (e) {
    console.error(' Error executing locking ', e);
  }
}

Bug Single record

Create a table with the following structure , Used to record Bug, Do status flow :

CREATE TABLE `bug_list` (     
  `id` INTEGER NOT NULL AUTO_INCREMENT,     
  `msgId` VARCHAR(191) NOT NULL,     
  `trace` VARCHAR(60) NOT NULL,     
  `assign` VARCHAR(30) NOT NULL,     
  `status` TINYINT(2) NOT NULL,     
  `remark` LONGTEXT,     
  `updatedAt` BIGINT(20) NOT NULL,    
  `createdAt` BIGINT(20) NOT NULL,     
   PRIMARY KEY (`id`),     
   unique key (msgId),     
   unique key (trace) 
) 

Bug Single query

When @ Robot time , Hope the robot can return to the current user for processing Bug single , And it can give buttons for state operation .

@ Callback handler :

//  Returns the currently developed  Bug  list 
export async function buglist(payload: WxMsg) {
  const { From: { Alias }, Text: { Content: raw }, ChatId } = payload;
  const title = `To: ${Alias}`;
  const result = await prisma.bug_list.findMany({
    where: {
      assign: Alias,
      status: {
        in: [1, 2],
      },
    },
  });
  if (!result.length) {
    //  Reply message 
    sendBack(title, {
       Tips : ' Congratulations, there is no pending matter under your name  Bug, To maintain the !',
    }, ChatId);
    return;
  }

  //  Generate  Bug  The message body of the list 
  let content = `### ${title} \n`;
  const attachments = [{
    callback_id: 'bug_status_change',
    actions: [],
  }] as unknown  as Attachments;
  result.forEach((one) => {
    content += `> **[ Full link log :${one.trace}](xxxx)**: <font color="comment">${one.msgId}</font> \n\n\n`;
    // important:  Here's for each  Bug  Generate corresponding processing buttons for single document 
    attachments[0].actions.push({
      name: String(one.id),
      text: one.status === 1 ? `${one.id}: Transfer to processing ` : `${one.id}: Customs clearance `,
      type: 'button',
      //  Use here   Identification of the message :msg  Of   front  100  byte 
      value: one.status === 1 ? '2' : '3',
      replace_text: one.status === 1 ? ' In processing ' : ' Processing is complete ',
      border_color: '2EAB49',
      text_color: '2EAB49',
    });
  });
  sendBack(content, attachments, ChatId);
}

When @ Robot time , The effect is as follows :

Bug Single circulation

Similar to the button in the alarm ,Bug After the single button is clicked , Handling state changes , At the same time to remove redis lock .

export async function bugStatusChange(payload: AttachmentMsg) {
  const {
    From: {
      Alias,
    },
    Attachment: {
      Actions: {
        Name,
        Value,
      },
    } } = payload;
  try {
    const theBug = await prisma.bug_list.update({
      data: {
        status: parseInt(Value, 10),
      },
      where: {
        id: parseInt(Name, 10),
      },
    });
    //  Remove lock 
    const { msgId } = theBug;
    const lockKey = getLockKey(msgId);
    defaultRedisClient.del(lockKey);
  } catch (e) {
    console.error(' to update  bug  Status error ', e);
  }
}

The effect is as follows :

summary

The root cause of invalid alarm flooding is the lack of alarm feedback mechanism . We use the enterprise wechat robot , Closed loop alarm 、 Alarm feedback 、Bug Tracking and circulation .

Technical points :

  1. Refuse to handle or no feedback three times , Stop the push of the same alarm briefly .
  2. Determination of the same alarm , Use error Of message.
  3. Use redis save 「 Alarm blacklist 」, Adapt to multi instance operation .
  4. A robot can be understood as a command line , Develop a more Africa friendly command line .
  5. The indicator alarm is usually triggered by setting the threshold , And often limited frequency processing ( Fluctuations near the threshold ), No feedback mechanism is required .

Executable code , Still sorting , Put it in the back github.

expand

Actually , There is an assumption above : There is a full link log system . Not only alarm , It should also be able to quickly find out relevant log location problems through alarms .

The following is a special introduction , How to build a full link log system ; There will also be a special article on the development of enterprise wechat robots .

原网站

版权声明
本文为[fjywan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201182352437379.html