当前位置:网站首页>You don't know about this inspection platform. It's a big loss!
You don't know about this inspection platform. It's a big loss!
2022-06-24 17:37:00 【Tencent proprietary cloud】
introduction
The inspection platform is an out of the box inspection product for operation and maintenance personnel , Provide automatic operation and maintenance capability to automatically diagnose problems . The product not only provides automatic patrol inspection capability and patrol inspection report to the operation and maintenance engineers , It also provides optimization suggestions of operation and maintenance experts' experience for the problems in the inspection report for reference during repair . The operation and maintenance personnel can also customize according to their own needs , Flexible customization of personalized patrol items to be added to regular patrol tasks through diversified patrol atomic capabilities , Patrol atomic capabilities include script patrol 、HTTP(S) Interface patrol inspection and IP On-Site Inspection ; The platform also has the classification ability to cover multiple vertical products and multi-dimensional inspection , The operation and maintenance personnel can be assigned to different personnel according to the product , Let different users subscribe to different patrol reports , So as to greatly reduce the workload of regular manual inspection by the operation and maintenance engineer .
01 Inspection status
With the cloud products connected to the VPC 、 More and more customers are being delivered , During the daily operation of the cloud platform, there will always be some difficult and hidden problems that give the operation and maintenance personnel a headache , To ensure the stable operation of the cloud platform 、 Business continuity , monitor 、 journal 、 Patrol inspection has become a standard component of the cloud platform , Patrol inspection is one of the important links in the operation and maintenance guarantee system , It can help the operation and maintenance personnel find hidden dangers in the system , Early governance , To prevent the trouble before it happens .
In the old patrol inspection scheme , The patrol inspection is carried out by the actuator + Timer + Excel, In large clusters , This old inspection scheme has gradually exposed some problems :
- The execution of patrol inspection task depends on the executing machine , There is a single point of failure
- The patrol inspection results are scattered in Excel in , It is not conducive to the collection, analysis and statistics of results
- Inspection script chimney development , No peace CMDB、 Monitoring alarm 、 Get through the message platform and other systems
- Patrol scripts are distributed among cloud products , There is no unified management platform
- ......
There are still many such problems , And with the availability of cloud platforms 、 reliability 、 performance 、 The water level 、 Higher and higher requirements for safety and other aspects , There will be more and more products to be inspected , A flexible 、 Stable 、 The scalable inspection system is extremely important .
02 Platform features
The patrol inspection platform has the following advantages :
Open the box : The platform is configured with a large number of patrol items and patrol plans by default , Regularly and automatically initiate patrol inspection tasks and send patrol inspection reports ;
Expert optimization suggestions : After the patrol inspection task is executed successfully , The platform will automatically send a patrol report , The report contains optimization suggestions from the experience of operation and maintenance experts for reference , Even novice O & M can diagnose O & M problems relatively easily ;
Flexible customization of patrol inspection items : For senior O & M personnel , You can use the upload script of the platform 、HTTP(S) and IP Patrol method: Customize patrol items , According to the access specification, new patrol items can be quickly expanded ;
Subscribe to reports and alarms : It can be done by email 、 WeChat 、 Subscribe to patrol reports of different patrol items by SMS, etc , You can also subscribe to the latest patrol results in time according to the alarm level .
03 Product architecture
application layer
The application layer acts as the unified entrance of the patrol platform , Provide patrol item management 、 Plan management 、 task management 、 Report detailed view, download and other functions .
overview : Display the number of alarms in the historical report ( serious / Warning / remind ) Sort TOP5 And trend chart 、 A trend chart showing the number of historical plans and tasks 、 Display the distribution proportion and quantity of patrol inspection in the historical report , Including products 、 Patrol Group 、 Alarm category .
Inspection plan : Customize the patrol items of the patrol plan 、 Frequency of execution 、 Alarm receiver , Start stop operation .
Patrol item : Support script patrol 、HTTP(S) On-Site Inspection 、IP On-Site Inspection (TCP、UDP、ICMP Agreements, etc , It supports setting timeout for patrol items , It supports configuring alarm rules for patrol results .
Patrol mission : Check the progress of the task 、 Stop task 、 Check the task log .
Inspection report : Support to view the overview and details of patrol report , And can download HTML/CSV Report in format .
Storage layer
The storage layer is responsible for saving the relevant data of the patrol platform , Divide into Etcd、MySQL、 Package management 3 A module :
- Etcd: Store dynamic data , For example, patrol inspection plan 、 Patrol item 、 Patrol inspection tasks in execution, etc , Introduce here Etcd Mainly used Watch Mechanism to realize the scheduled scheduling of plans and the triggering of patrol inspection tasks ;
- MySQL: Used to store static data , For example, patrol inspection report 、 Execute the completed patrol inspection task 、 Operation records, etc ;
- Package management : Package management is an independent service outside the patrol platform , Software package uploading is provided 、 download 、 The function of version management , Package management is introduced here to manage patrol scripts .
Logic layer & Scheduling layer
Logic layer & The scheduling layer is responsible for the scheduling and execution of patrol inspection tasks 、 Results collection 、 Rule judgment 、 Report sending and other core logic . The patrol inspection platform realizes the patrol inspection task scheduling and execution with the help of the process choreography engine , The process choreography engine is the basic component of the VPC automation operation and maintenance , It provides the ability of process choreography and process scheduling to the designated machine for execution , It provides time-out control in the orchestration capability 、 Sub process 、 Branch splitting, merging, etc , In terms of execution capability, it provides command execution 、 Perform result collection 、 Execution result context sharing, etc , These capabilities are sufficient to cover the needs of the patrol platform :
- InspectionItem Controller: Be responsible for translating the inspection items of the inspection platform into the process template recognized by the process choreography engine , At the same time, maintain the mapping relationship between patrol items and process templates , Each patrol item corresponds to a process template in the process choreography engine ;
- Job Controller: Responsible for consumption Etcd Patrol task in , A patrol task is associated with one or more patrol items . First, a parent process template is generated according to the patrol items associated with the patrol task , Each patrol item is associated with one of the parent process templates SubWorkflow Type of Node Corresponding ; Then a process instance will be created with this parent process template , Each patrol task corresponds to a process instance of the process choreography engine , After the instance is created, the status of the process instance will be queried continuously , Until the process instance is in the final state (Succeeded、Failed); Then, the patrol inspection results will be collected according to the execution output of the process instance ; Finally, judge the rules according to the inspection results 、 Patrol report generation and sending through the message platform ;
- CronJob Controller: Responsible for periodically creating patrol inspection tasks , It's actually going to Etcd Produce a patrol inspection task .
Under large-scale patrol inspection , In order to ensure the stability of the patrol platform and prevent a large number of concurrent tasks from exploding the patrol target , The patrol platform provides a concurrency control strategy 、 Timeout control strategy . In terms of concurrency control, it provides 3 There are three strategies for users to choose , Users can select corresponding policies according to different business scenarios :
- AllowConcurrent: Allow concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will normally create patrol tasks , Run two tasks at the same time ;
- ForbidConcurrent: Prohibit concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will give up the task of creating this point in time , Until the last patrol inspection task is completed ;
- ReplaceConcurrent: Replacement strategy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will stop the execution of the previous task , Then create a new patrol task ;
In terms of timeout control , You can set the timeout for each patrol item , The patrol items that have timed out will be automatically kill Continue to execute the next patrol item .
Executive
The execution layer consists of one or more execution machines , Each execution machine deploys the process engine command channel dependency agent、python Environmental Science , from agent Execution comes from the logical layer & Patrol inspection tasks issued by the dispatching layer , The execution result of the task passes stdout Or common functions set_output The exported data is collected to the patrol inspection platform . Patrol inspection depends on the stability of the target service , In case of network jitter, patrol failure may occur , In order to weaken the impact of these environmental factors , Elegant retry mechanism is added in the execution layer , It will retry when the patrol fails , The default number of retries is 3 Time , It will wait for a random time within a certain range and then try again .
04 Effect of implementation
Built in patrol items
The patrol platform has built-in 400+ Patrol item , Covers the calculation 、 The Internet 、 Storage 、 Platform and other core products , Patrol types cover availability 、 reliability 、 performance 、 The water level 、 Security , These built-in patrol items are available out of the box , One click patrol , No technical threshold .
Inspection results
The patrol inspection platform has been running stably for more than half a year , Access 400+ Patrol items , The patrol item covers most products of the VPC , Yes 30000+ Patrol inspection task , All in all 200000+ There are system hidden dangers .
lately 7 The trend chart of system hidden dangers found in the daily inspection is as follows :
lately 7 Daily patrol inspection report - The distribution of patrol inspection items is as follows :
At present, the patrol inspection platform standardizes patrol inspection by formulating patrol inspection item development specifications , The patrol inspection process is realized by scheduling and executing patrol inspection through the platform 、 automation , However, the hidden dangers found after patrol inspection still need to be solved manually , Unable to perform patrol inspection + Governance is fully automated , The next step will be linkage intelligent diagnosis 、 Try to explore big data analysis , Cooperate with the operation and maintenance platform to realize automatic processing .
-END-
边栏推荐
- Customizing security groups using BPF
- C language | logical operators
- EasyPlayer流媒体播放器播放HLS视频,起播速度慢的技术优化
- -Bash: wget: command not found
- C4D learning notes
- Industrial security experts talk about DDoS countermeasures from the perspective of attack and defense
- FPGA systematic learning notes serialization_ Day8 [design of 4-bit multiplier and 4-bit divider]
- 专有云TCE COS新一代存储引擎YottaStore介绍
- Install MySQL using Yum for Linux
- TCE入围2020年工信部信创典型解决方案
猜你喜欢
Using consistent hash algorithm in Presto to enhance the data cache locality of dynamic clusters
Mengyou Technology: tiktok current limiting? Teach you to create popular copywriting + popular background music selection
Using flex to implement common layouts
About swagger
NVM download, installation and use
LC 300. Longest increasing subsequence
Issue 39: MySQL time class partition write SQL considerations
Why do you develop middleware when you are young? "You can choose your own way"
Constantly changing the emergency dialing of harmonyos ETS during the new year
The 'ng' entry cannot be recognized as the name of a cmdlet, function, script file, or runnable program. Check the spelling of the name. If you include a path, make sure the path is correct, and then
随机推荐
Leveldb source code analysis -- log file format
Mengyou Technology: tiktok current limiting? Teach you to create popular copywriting + popular background music selection
LC 300. Longest increasing subsequence
A solution to the problem that the separator of WordPress title - is escaped as -
Failure analysis | database failure MHA is not switched
Kubernetes 1.20.5 helm installation Jenkins
Setting the Arduino environment for tinyml experiments
Following the previous SYSTEMd pit
-Bash: wget: command not found
Erc-721 Standard Specification
Use cloud development to make a login free resource navigation applet!
How to use SEO to increase the inquiry volume?
Cloud native monitoring via blackbox_ Exporter monitoring website
"Gambler" bubble Matt turns around
Tencent cloud layer 7 load balancing log analysis and monitoring
NVM download, installation and use
Service not found troubleshooting and resolution of error messages in the secondary development of the source code of the open source platform easydarwin
[2021 taac & Ti-One] frequently asked questions related to Ti-One products
Research on clock synchronization performance monitoring system based on 1588v2 Technology
专有云TCE COS新一代存储引擎YottaStore介绍