当前位置:网站首页>You don't know about this inspection platform. It's a big loss!
You don't know about this inspection platform. It's a big loss!
2022-06-24 17:37:00 【Tencent proprietary cloud】
introduction
The inspection platform is an out of the box inspection product for operation and maintenance personnel , Provide automatic operation and maintenance capability to automatically diagnose problems . The product not only provides automatic patrol inspection capability and patrol inspection report to the operation and maintenance engineers , It also provides optimization suggestions of operation and maintenance experts' experience for the problems in the inspection report for reference during repair . The operation and maintenance personnel can also customize according to their own needs , Flexible customization of personalized patrol items to be added to regular patrol tasks through diversified patrol atomic capabilities , Patrol atomic capabilities include script patrol 、HTTP(S) Interface patrol inspection and IP On-Site Inspection ; The platform also has the classification ability to cover multiple vertical products and multi-dimensional inspection , The operation and maintenance personnel can be assigned to different personnel according to the product , Let different users subscribe to different patrol reports , So as to greatly reduce the workload of regular manual inspection by the operation and maintenance engineer .
01 Inspection status
With the cloud products connected to the VPC 、 More and more customers are being delivered , During the daily operation of the cloud platform, there will always be some difficult and hidden problems that give the operation and maintenance personnel a headache , To ensure the stable operation of the cloud platform 、 Business continuity , monitor 、 journal 、 Patrol inspection has become a standard component of the cloud platform , Patrol inspection is one of the important links in the operation and maintenance guarantee system , It can help the operation and maintenance personnel find hidden dangers in the system , Early governance , To prevent the trouble before it happens .
In the old patrol inspection scheme , The patrol inspection is carried out by the actuator + Timer + Excel, In large clusters , This old inspection scheme has gradually exposed some problems :
- The execution of patrol inspection task depends on the executing machine , There is a single point of failure
- The patrol inspection results are scattered in Excel in , It is not conducive to the collection, analysis and statistics of results
- Inspection script chimney development , No peace CMDB、 Monitoring alarm 、 Get through the message platform and other systems
- Patrol scripts are distributed among cloud products , There is no unified management platform
- ......
There are still many such problems , And with the availability of cloud platforms 、 reliability 、 performance 、 The water level 、 Higher and higher requirements for safety and other aspects , There will be more and more products to be inspected , A flexible 、 Stable 、 The scalable inspection system is extremely important .
02 Platform features
The patrol inspection platform has the following advantages :
Open the box : The platform is configured with a large number of patrol items and patrol plans by default , Regularly and automatically initiate patrol inspection tasks and send patrol inspection reports ;
Expert optimization suggestions : After the patrol inspection task is executed successfully , The platform will automatically send a patrol report , The report contains optimization suggestions from the experience of operation and maintenance experts for reference , Even novice O & M can diagnose O & M problems relatively easily ;
Flexible customization of patrol inspection items : For senior O & M personnel , You can use the upload script of the platform 、HTTP(S) and IP Patrol method: Customize patrol items , According to the access specification, new patrol items can be quickly expanded ;
Subscribe to reports and alarms : It can be done by email 、 WeChat 、 Subscribe to patrol reports of different patrol items by SMS, etc , You can also subscribe to the latest patrol results in time according to the alarm level .
03 Product architecture
application layer
The application layer acts as the unified entrance of the patrol platform , Provide patrol item management 、 Plan management 、 task management 、 Report detailed view, download and other functions .
overview : Display the number of alarms in the historical report ( serious / Warning / remind ) Sort TOP5 And trend chart 、 A trend chart showing the number of historical plans and tasks 、 Display the distribution proportion and quantity of patrol inspection in the historical report , Including products 、 Patrol Group 、 Alarm category .
Inspection plan : Customize the patrol items of the patrol plan 、 Frequency of execution 、 Alarm receiver , Start stop operation .
Patrol item : Support script patrol 、HTTP(S) On-Site Inspection 、IP On-Site Inspection (TCP、UDP、ICMP Agreements, etc , It supports setting timeout for patrol items , It supports configuring alarm rules for patrol results .
Patrol mission : Check the progress of the task 、 Stop task 、 Check the task log .
Inspection report : Support to view the overview and details of patrol report , And can download HTML/CSV Report in format .
Storage layer
The storage layer is responsible for saving the relevant data of the patrol platform , Divide into Etcd、MySQL、 Package management 3 A module :
- Etcd: Store dynamic data , For example, patrol inspection plan 、 Patrol item 、 Patrol inspection tasks in execution, etc , Introduce here Etcd Mainly used Watch Mechanism to realize the scheduled scheduling of plans and the triggering of patrol inspection tasks ;
- MySQL: Used to store static data , For example, patrol inspection report 、 Execute the completed patrol inspection task 、 Operation records, etc ;
- Package management : Package management is an independent service outside the patrol platform , Software package uploading is provided 、 download 、 The function of version management , Package management is introduced here to manage patrol scripts .
Logic layer & Scheduling layer
Logic layer & The scheduling layer is responsible for the scheduling and execution of patrol inspection tasks 、 Results collection 、 Rule judgment 、 Report sending and other core logic . The patrol inspection platform realizes the patrol inspection task scheduling and execution with the help of the process choreography engine , The process choreography engine is the basic component of the VPC automation operation and maintenance , It provides the ability of process choreography and process scheduling to the designated machine for execution , It provides time-out control in the orchestration capability 、 Sub process 、 Branch splitting, merging, etc , In terms of execution capability, it provides command execution 、 Perform result collection 、 Execution result context sharing, etc , These capabilities are sufficient to cover the needs of the patrol platform :
- InspectionItem Controller: Be responsible for translating the inspection items of the inspection platform into the process template recognized by the process choreography engine , At the same time, maintain the mapping relationship between patrol items and process templates , Each patrol item corresponds to a process template in the process choreography engine ;
- Job Controller: Responsible for consumption Etcd Patrol task in , A patrol task is associated with one or more patrol items . First, a parent process template is generated according to the patrol items associated with the patrol task , Each patrol item is associated with one of the parent process templates SubWorkflow Type of Node Corresponding ; Then a process instance will be created with this parent process template , Each patrol task corresponds to a process instance of the process choreography engine , After the instance is created, the status of the process instance will be queried continuously , Until the process instance is in the final state (Succeeded、Failed); Then, the patrol inspection results will be collected according to the execution output of the process instance ; Finally, judge the rules according to the inspection results 、 Patrol report generation and sending through the message platform ;
- CronJob Controller: Responsible for periodically creating patrol inspection tasks , It's actually going to Etcd Produce a patrol inspection task .
Under large-scale patrol inspection , In order to ensure the stability of the patrol platform and prevent a large number of concurrent tasks from exploding the patrol target , The patrol platform provides a concurrency control strategy 、 Timeout control strategy . In terms of concurrency control, it provides 3 There are three strategies for users to choose , Users can select corresponding policies according to different business scenarios :
- AllowConcurrent: Allow concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will normally create patrol tasks , Run two tasks at the same time ;
- ForbidConcurrent: Prohibit concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will give up the task of creating this point in time , Until the last patrol inspection task is completed ;
- ReplaceConcurrent: Replacement strategy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will stop the execution of the previous task , Then create a new patrol task ;
In terms of timeout control , You can set the timeout for each patrol item , The patrol items that have timed out will be automatically kill Continue to execute the next patrol item .
Executive
The execution layer consists of one or more execution machines , Each execution machine deploys the process engine command channel dependency agent、python Environmental Science , from agent Execution comes from the logical layer & Patrol inspection tasks issued by the dispatching layer , The execution result of the task passes stdout Or common functions set_output The exported data is collected to the patrol inspection platform . Patrol inspection depends on the stability of the target service , In case of network jitter, patrol failure may occur , In order to weaken the impact of these environmental factors , Elegant retry mechanism is added in the execution layer , It will retry when the patrol fails , The default number of retries is 3 Time , It will wait for a random time within a certain range and then try again .
04 Effect of implementation
Built in patrol items
The patrol platform has built-in 400+ Patrol item , Covers the calculation 、 The Internet 、 Storage 、 Platform and other core products , Patrol types cover availability 、 reliability 、 performance 、 The water level 、 Security , These built-in patrol items are available out of the box , One click patrol , No technical threshold .
Inspection results
The patrol inspection platform has been running stably for more than half a year , Access 400+ Patrol items , The patrol item covers most products of the VPC , Yes 30000+ Patrol inspection task , All in all 200000+ There are system hidden dangers .
lately 7 The trend chart of system hidden dangers found in the daily inspection is as follows :
lately 7 Daily patrol inspection report - The distribution of patrol inspection items is as follows :
At present, the patrol inspection platform standardizes patrol inspection by formulating patrol inspection item development specifications , The patrol inspection process is realized by scheduling and executing patrol inspection through the platform 、 automation , However, the hidden dangers found after patrol inspection still need to be solved manually , Unable to perform patrol inspection + Governance is fully automated , The next step will be linkage intelligent diagnosis 、 Try to explore big data analysis , Cooperate with the operation and maintenance platform to realize automatic processing .
-END-
边栏推荐
- "Competition" and "opportunity" hidden in security operation in the cloud Era
- How much does it cost to develop a small adoption program similar to QQ farm?
- Use cloud development to make a login free resource navigation applet!
- Following the previous SYSTEMd pit
- Go language GC implementation principle and source code analysis
- See through the new financial report of Tencent music, online music needs b+c
- Snapshot management for elastic cloud enterprise
- QQ domain name detection API interface sharing (with internal access automatic jump PHP code)
- CentOS 7 installing SQL server2017 (Linux)
- 浅谈云流送多人交互技术原理
猜你喜欢

Etching process flow for PCB fabrication

The 'ng' entry cannot be recognized as the name of a cmdlet, function, script file, or runnable program. Check the spelling of the name. If you include a path, make sure the path is correct, and then

How to create simple shapes in illustrator 2022
SQL basic tutorial (learning notes)
Using consistent hash algorithm in Presto to enhance the data cache locality of dynamic clusters

LC 300. Longest increasing subsequence

Mengyou Technology: tiktok current limiting? Teach you to create popular copywriting + popular background music selection
Issue 39: MySQL time class partition write SQL considerations
About swagger

NVM download, installation and use
随机推荐
EasyGBS视频平台TCP主动模式拉流异常情况修复
LC 300. Longest increasing subsequence
Setting the Arduino environment for tinyml experiments
How to learn go language happily? Let's go!
Easycvr, an urban intelligent video monitoring image analysis platform, plays national standard equipment videos and captures unstable packets for troubleshooting
Do you charge for PDF merging software? Programmers make one by themselves
Leveldb source code analysis -- log file format
Analysis of software supply chain attack package preemption low cost phishing
[2021 taac & Ti-One] FAQs related to preliminary round computing resources
Introduction to visual studio shortcut keys and advanced gameplay
What securities dealers recommend? Is it safe to open an account online now?
究竟有哪些劵商推荐?现在网上开户安全么?
About swagger
Dunhuang Research Institute and Tencent have launched a new strategic cooperation to take you around the digital new silk road with AI
Provide secure and convenient Oracle solutions for smart contract developers
Comparison of similarities and differences between easynvr video edge computing gateway and easynvr software versions
Research on clock synchronization performance monitoring system based on 1588v2 Technology
Issue 003 how to detect whether a sticky positioned element is in a pinned state
Leveldb source code analysis -- writing data
How to troubleshoot and solve the problem that the ultra-low delay security live broadcast system webrtc client plays no audio in the browser?