当前位置:网站首页>An implementation of warning bombing
An implementation of warning bombing
2022-06-23 03:27:00 【fjywan】
background
Monitoring and warning are like eyes , It is the window of observation application : The health of the service , Sense the abnormality in time . And the way to perceive anomalies , Is an alarm , WeChat 、 mail 、 SMS , Whatever the way , The purpose is to remind the service 「 Probably 」 Existing problems .
The alarm , According to the content, it can be divided into two categories :
- Indicator based alarm
- Log based alarms
indicators (metric): It is usually aggregated from logs , For example, the average time spent 、500 And so on . When the indicator exceeds a certain threshold , Triggered alarm , Classified as indicator based alarm .
journal : It is the behavior flow of service , The most detailed content . When a error Type of log , Triggered alarm , Classified as log based alarms .
From the definition of the above classification , Easy to see , The alarm based on log is the easiest to form alarm bombing , such as :
- On a call link , Something unusual , This will often lead to exceptions in all subsequent nodes , A series of abnormal logs lead to alarm bombing .
- The log rating is unreasonable , For example, the user input is illegal console.error To record , It belongs to the function of abusing alarms as reminders .
- 「 An anomaly in a normal situation 」, such as , When you find an alarm on the line, you don't have to worry about it , Because the verification of dependent services has changed , But we can't change the code to shield an alarm 、 Release .
The more invalid alarms are mixed , The more difficult it is to find abnormal problems , If it is allowed to spread , The alarm will eventually lose the function of timely sensing abnormalities .
Problem analysis
Carefully analyze the alarms that cause interference , Can be divided into :
- It does indicate that the service is abnormal :
- But the frequency is too high .
- The problem has been confirmed , In the process of repair and release , Interfere with other abnormal alarms .
- An alarm that does not indicate an abnormal service , It should be shielded , No more push .
No matter what kind of interference alarm , The root causes are : Lack of alarm feedback mechanism .
The alarm system should not only push the alarm , It should also be able to sense whether the development has handled the alarm .
Only the alarm system can sense how the development handles the alarm : Refuse to deal with 、 Accept processing 、 Ignore , According to the feedback , Adjust push .
Through analysis , It is clear to solve the invalid alarm , That is to add a feedback mechanism to the alarm system .
The project design
The core part of the whole scheme : How to respond to development feedback , Design push strategy .
Push strategy
For an alarm , There are three options for development :
- Ignore
- Refuse
- Accept
The push policy corresponding to each option :
- Ignore - Do not process for three consecutive times ( Neither refuse nor accept ), Stop pushing the same alarm within one day
- Refuse - Stop pushing the same alarm within three days
- Accept - Stop pushing the same alarm , And new BUG single , stay BUG The order status is changed to after repair , Restore alarm .
From the push strategy , There are several points that need to be further refined :
- How to determine the same alarm , That is, how to calculate the identification of alarm information
- Alarm and Bug Get through the single , as well as Bug Single status flow .
Alarm identification
Structured data is usually behind the alarm information , contain traceid、message、error stack etc. .
If the alarm message identical , Meaning the same , It can be judged as the same alarm . therefore , The alarm identification can be taken as message Before 100 byte .
Bug Order and status flow
First one Bug At least the following attributes should be recorded :
- msgid: Alarm message identification
- trace: Alarm link id, Used for logging system
- assign: Dealing with people
- status: bug Status of the order
Bug Status of the order status And circulation :
Realization
Take the enterprise wechat robot as an alarm tool ( For the usage of enterprise wechat robot, please refer to the developer documentation ).
The implementation of push
1. Get the callback address of the enterprise wechat robot
namely Webhook Address , When you create a new robot, you will be given :
2. Output the log to the robot
Use log4js As a logging tool library .
import log4js from 'log4js';
Development customization appender, Output log to robot
function robotAppender(layout, timezoneOffset) {
return (loggingEvent) => {
const logCtx = loggingEvent.context;
// If the log level is error above , senior
if ((loggingEvent.level as Level).isGreaterThanOrEqualTo(levels.ERROR)) {
// Call the robot alarm
sendAlert(`[${msgObj.level}]${projectName}`, {
path: loggingEvent.context.path || '', // path
ctx: ctxStr.length > ctxStrLimit ? requestDataStr : ctxStr, // ctx
msg: (layout(loggingEvent, timezoneOffset) as string)?.slice(0, ctxStrLimit) || '', // Log contents
trace: loggingEvent.context.trace || '', // trace_id
});
return true;
}
};
};
export function wxConfigure(config: any, layouts: any) {
let layout = layouts.colouredLayout;
if (config.layout) {
layout = layouts.layout(config.layout.type, config.layout);
}
return robotAppender(layout, config.timezoneOffset);
}
// Configuration to log4js
log4js.configure({
appenders: {
console: {
type: 'console',
},
// Enterprise wechat robot notification
wx: {
type: { configure: wxConfigure },
layout: { type: 'basic' },
},
},
categories: {
default: { appenders: ['console', 'wx'], level: 'debug' },
},
});3. Encapsulate the alarm function sendAlert
Apply the sending strategy in the alarm function :
- For the alarm determined as invalid ,redis Lock , Prevent sending again .
- For each alarm sent , stay redis Li count , The same alarm has not been processed for more than three times , Perform locking .
Pay special attention here : stay redis To execute counting in key To set the expiration time , such as 1h、1d, Because the amount of logs is often very large , There is no failure mechanism that will redis Memory burst .
async function sendAlert(title: string, data: Record<string, any>, chatid?: string) {
// Calculate the alarm information identification , take msg Before 100 byte
const msgId = getMsgId(data.msg);
// Judge whether there is a lock first
const lockKey = `${msgId}_lock`;
// Use here ioredis, skip redisClient Encapsulation
const lock = await defaultRedisClient.get(lockKey);
if (lock) {
console.log('lock exsit, skip alert', title, data);
return;
}
// Count
let rawCounter = await defaultRedisClient.get(msgId);
// If you haven't sent it before , initialization
if (!rawCounter) {
rawCounter = '0';
}
const counter = parseInt(rawCounter, 10);
// If it has been sent 3 Times or more , Lock , This sending is prohibited
if (counter > 2) {
// rm counter
// First rm, Sure rm Failure , The alarm count will be entered next time
await defaultRedisClient.del(msgId);
// add lock
await defaultRedisClient.setex(lockKey, 1 * 24 * 60 * 60 * 1000, data?.trace);
// You can push the prompt :
// (` Three unprocessed alarms : ${msgId} \n\n\n
// The alarm push has been terminated ,24h Recover after !
// `, undefined, chatid);
return;
}
// Otherwise it's just a count plus one , Pay attention to the expiration time
await defaultRedisClient.setex(msgId, 1 * 24 * 60 * 60 * 1000, String(counter + 1));
const copyedData = {
env,
...data,
};
let content = `### ${title} \n`;
Object.keys(copyedData).forEach((key) => {
content += `> **${key}**: <font color="comment">${copyedData[key]}</font> \n\n\n`;
});
const msgObj = {
chatid,
msgtype: 'markdown',
markdown: {
content,
// Note that there : Button to collect feedback
attachments: [{
callback_id: 'alert_feedback',
actions: [{
name: `reject_${data?.trace}`,
text: ' Refuse ',
type: 'button',
// Use here Identification of the message :msg Of front 100 byte
value: msgId,
replace_text: ' Rejected ',
border_color: '2EAB49',
text_color: '2EAB49',
},
{
name: `accept_${data?.trace}`,
text: ' Accept ',
type: 'button',
value: msgId,
replace_text: ' Accepted ',
border_color: '2EAB49',
text_color: '2EAB49',
},
],
},
],
},
};
// url Callback address for robot
return axios.post(url, msgObj, {
headers: {
'Content-Type': 'application/json',
},
});
}Pay special attention to the... Passed in by calling the robot interface attachments, Feedback buttons can be added to each alarm , effect :
An easily overlooked point : How to set the... Of each button name、value.
From the code above, you can see :
{
name: `accept_${data?.trace}`,
value: msgId,
},These two fields , When the user clicks the button , Call us back intact , therefore , Make good use of these two fields for data transmission :
- msgid, It is the necessary information for locking , Also built bug Required fields of the order .
- trace, Full link id, build bug Only need , For tracing to the log system .
Accept the message of the button click
The developer clicks the alarm button , At this time, the alarm push strategy should be adjusted , say concretely , Is to lock specific messages , Prevent push .
Here we will develop a HTTP Server, And correctly handle the verification request of enterprise wechat .( This part is a separate article )
Now focus on the processing after clicking the button : When the developer clicks the button , Enterprise wechat will launch a HTTP Request to us Server, After decrypting the requested data , You will get data similar to the following :
{
From: {
UserId: 'xxxxxxx',
Name: 'fjywan',
Alias: 'fjywan'
},
WebhookUrl: 'http://in.qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxx',
ChatId: 'xxxx',
GetChatInfoUrl: 'http://in.qyapi.weixin.qq.com/cgi-bin/webhook/get_chat_info?code=xxxxx',
MsgId: 'xxxxx',
ChatType: 'group',
MsgType: 'attachment',
Attachment: {
CallbackId: 'alert_feedback',
Actions: {
Name: 'accept-traceidxxx',
Value: 'msgidxxxx',
Type: 'button'
}
},
TriggerId: 'xxxx',
}Let's deal with this message :
function getLockKey(msgId: string) {
return `${msgId}_lock`;
}
enum BugStatus {
Created = 1,
Processing = 2,
Done = 3
}
export async function alertFeedBack(payload: AttachmentMsg) {
const {
From: {
Alias,
},
Attachment: {
Actions: {
Name,
Value,
},
} } = payload;
const lockKey = getLockKey(Value);
const [actualName, trace] = Name.split('_');
// If there is counter, Remove... First
await defaultRedisClient.del(Value);
try {
// Accept the alarm processing
if (actualName === 'accept') {
// Add a fail safe lock
await defaultRedisClient.setnx(lockKey, Name);
const now = Date.now();
// Use here ORM prisma Go to MYSQL Insert a piece of data bug data
await prisma.bug_list.create({
data: {
assign: Alias,
trace,
msgId: Value,
status: BugStatus.Created,
updatedAt: now,
createdAt: now,
},
});
} else {
// Reject alarm processing
// redis Lock ,3 Days , There are no reminders behind
// If you push three consecutive , The user does not handle , Lock it for a day
await defaultRedisClient.setex(lockKey, 3 * 24 * 60 * 60 * 1000, Name);
}
} catch (e) {
console.error(' Error executing locking ', e);
}
}Bug Single record
Create a table with the following structure , Used to record Bug, Do status flow :
CREATE TABLE `bug_list` ( `id` INTEGER NOT NULL AUTO_INCREMENT, `msgId` VARCHAR(191) NOT NULL, `trace` VARCHAR(60) NOT NULL, `assign` VARCHAR(30) NOT NULL, `status` TINYINT(2) NOT NULL, `remark` LONGTEXT, `updatedAt` BIGINT(20) NOT NULL, `createdAt` BIGINT(20) NOT NULL, PRIMARY KEY (`id`), unique key (msgId), unique key (trace) )
Bug Single query
When @ Robot time , Hope the robot can return to the current user for processing Bug single , And it can give buttons for state operation .
@ Callback handler :
// Returns the currently developed Bug list
export async function buglist(payload: WxMsg) {
const { From: { Alias }, Text: { Content: raw }, ChatId } = payload;
const title = `To: ${Alias}`;
const result = await prisma.bug_list.findMany({
where: {
assign: Alias,
status: {
in: [1, 2],
},
},
});
if (!result.length) {
// Reply message
sendBack(title, {
Tips : ' Congratulations, there is no pending matter under your name Bug, To maintain the !',
}, ChatId);
return;
}
// Generate Bug The message body of the list
let content = `### ${title} \n`;
const attachments = [{
callback_id: 'bug_status_change',
actions: [],
}] as unknown as Attachments;
result.forEach((one) => {
content += `> **[ Full link log :${one.trace}](xxxx)**: <font color="comment">${one.msgId}</font> \n\n\n`;
// important: Here's for each Bug Generate corresponding processing buttons for single document
attachments[0].actions.push({
name: String(one.id),
text: one.status === 1 ? `${one.id}: Transfer to processing ` : `${one.id}: Customs clearance `,
type: 'button',
// Use here Identification of the message :msg Of front 100 byte
value: one.status === 1 ? '2' : '3',
replace_text: one.status === 1 ? ' In processing ' : ' Processing is complete ',
border_color: '2EAB49',
text_color: '2EAB49',
});
});
sendBack(content, attachments, ChatId);
}When @ Robot time , The effect is as follows :
Bug Single circulation
Similar to the button in the alarm ,Bug After the single button is clicked , Handling state changes , At the same time to remove redis lock .
export async function bugStatusChange(payload: AttachmentMsg) {
const {
From: {
Alias,
},
Attachment: {
Actions: {
Name,
Value,
},
} } = payload;
try {
const theBug = await prisma.bug_list.update({
data: {
status: parseInt(Value, 10),
},
where: {
id: parseInt(Name, 10),
},
});
// Remove lock
const { msgId } = theBug;
const lockKey = getLockKey(msgId);
defaultRedisClient.del(lockKey);
} catch (e) {
console.error(' to update bug Status error ', e);
}
}The effect is as follows :
summary
The root cause of invalid alarm flooding is the lack of alarm feedback mechanism . We use the enterprise wechat robot , Closed loop alarm 、 Alarm feedback 、Bug Tracking and circulation .
Technical points :
- Refuse to handle or no feedback three times , Stop the push of the same alarm briefly .
- Determination of the same alarm , Use error Of message.
- Use redis save 「 Alarm blacklist 」, Adapt to multi instance operation .
- A robot can be understood as a command line , Develop a more Africa friendly command line .
- The indicator alarm is usually triggered by setting the threshold , And often limited frequency processing ( Fluctuations near the threshold ), No feedback mechanism is required .
Executable code , Still sorting , Put it in the back github.
expand
Actually , There is an assumption above : There is a full link log system . Not only alarm , It should also be able to quickly find out relevant log location problems through alarms .
The following is a special introduction , How to build a full link log system ; There will also be a special article on the development of enterprise wechat robots .
边栏推荐
- Cve-2021-4034 reappearance
- Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]
- [quick view] Analysis on the development status and future development trend of the global and Chinese diamond cultivation industry in 2021 [figure]
- Initialize MySQL Gorm through yaml file
- Copy system disk
- What are the advantages of the completely free and open source flutter? How to learn about flutter?
- Flink practice tutorial: advanced 7- basic operation and maintenance
- The logical operators |, & &!
- Auto rename when uploading pictures on WordPress media
- 2022-01-25: serialize and deserialize n-ary tree. Serialization means that a
猜你喜欢

Gakataka student end to bundle Version (made by likewendy)

Encryption related to returnee of national market supervision public service platform

Analysis on the development of China's graphene industry chain in 2021: with the support of energy conservation and environmental protection policies, the scale of graphene industry will continue to e

Fetch request details
![Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]](/img/81/4d3cb059f700dd9243645e64023be7.jpg)
Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]
![Analysis on the development prospect of China's brain computer interface industry in 2021: wide application prospect, sustained and rapid growth of market scale [figure]](/img/84/192d152ceb760264b6b555b321f129.jpg)
Analysis on the development prospect of China's brain computer interface industry in 2021: wide application prospect, sustained and rapid growth of market scale [figure]

Detailed discussion on modular architecture design of MCU firmware

Analysis on the development of China's satellite navigation industry chain in 2021: satellite navigation is fully integrated into production and life, and the satellite navigation industry is also boo
![Analysis on demand and market scale of China's steamed stuffed bun industry in 2020 [figure]](/img/4b/dd272f98b89a157180bf68570d2763.jpg)
Analysis on demand and market scale of China's steamed stuffed bun industry in 2020 [figure]

Jmeter- (V) simulated user concurrent login for interface test
随机推荐
Learning record -- superficial understanding of unity decoupling
TRTC zero foundation -- Video subscription on the code
If there is a smart bus visualization platform, can "beginning" restart indefinitely?
How does native JS get the child elements of the parent element that the current element belongs to
Communication between containers flannel and calico comparison
Implementation process of the new electronic amplification function of easycvr video fusion cloud platform
CVE-2021-21973 Vmware Vcenter SSRF POC
The performance of the new Tokio scheduler is improved by 10 times
Why not use math random() ?!
Regeorg actual attack and defense
SAP mm initial transaction code MEK1 maintenance pb00 price
How to make distribution box label
Analysis of the number of urban residents covered by basic medical insurance, their treatment and medical treatment in other places in China in 2021 [figure]
Composition and simple classification of IP addresses
Installing serverstatus probe using pagoda
The difference between code39 and code93
CentOS install redis
How to share small programs released by wechat
Eight models of data analysis: detailed explanation of RFM model
Flink practice tutorial: advanced 7- basic operation and maintenance