当前位置:网站首页>You don't know about this inspection platform. It's a big loss!

You don't know about this inspection platform. It's a big loss!

2022-06-24 17:37:00 Tencent proprietary cloud

introduction

The inspection platform is an out of the box inspection product for operation and maintenance personnel , Provide automatic operation and maintenance capability to automatically diagnose problems . The product not only provides automatic patrol inspection capability and patrol inspection report to the operation and maintenance engineers , It also provides optimization suggestions of operation and maintenance experts' experience for the problems in the inspection report for reference during repair . The operation and maintenance personnel can also customize according to their own needs , Flexible customization of personalized patrol items to be added to regular patrol tasks through diversified patrol atomic capabilities , Patrol atomic capabilities include script patrol 、HTTP(S) Interface patrol inspection and IP On-Site Inspection ; The platform also has the classification ability to cover multiple vertical products and multi-dimensional inspection , The operation and maintenance personnel can be assigned to different personnel according to the product , Let different users subscribe to different patrol reports , So as to greatly reduce the workload of regular manual inspection by the operation and maintenance engineer .

01 Inspection status

With the cloud products connected to the VPC 、 More and more customers are being delivered , During the daily operation of the cloud platform, there will always be some difficult and hidden problems that give the operation and maintenance personnel a headache , To ensure the stable operation of the cloud platform 、 Business continuity , monitor 、 journal 、 Patrol inspection has become a standard component of the cloud platform , Patrol inspection is one of the important links in the operation and maintenance guarantee system , It can help the operation and maintenance personnel find hidden dangers in the system , Early governance , To prevent the trouble before it happens .

In the old patrol inspection scheme , The patrol inspection is carried out by the actuator + Timer + Excel, In large clusters , This old inspection scheme has gradually exposed some problems :

  • The execution of patrol inspection task depends on the executing machine , There is a single point of failure
  • The patrol inspection results are scattered in Excel in , It is not conducive to the collection, analysis and statistics of results
  • Inspection script chimney development , No peace CMDB、 Monitoring alarm 、 Get through the message platform and other systems
  • Patrol scripts are distributed among cloud products , There is no unified management platform
  • ......

There are still many such problems , And with the availability of cloud platforms 、 reliability 、 performance 、 The water level 、 Higher and higher requirements for safety and other aspects , There will be more and more products to be inspected , A flexible 、 Stable 、 The scalable inspection system is extremely important .

02 Platform features

The patrol inspection platform has the following advantages :

Open the box : The platform is configured with a large number of patrol items and patrol plans by default , Regularly and automatically initiate patrol inspection tasks and send patrol inspection reports ;

Expert optimization suggestions : After the patrol inspection task is executed successfully , The platform will automatically send a patrol report , The report contains optimization suggestions from the experience of operation and maintenance experts for reference , Even novice O & M can diagnose O & M problems relatively easily ;

Flexible customization of patrol inspection items : For senior O & M personnel , You can use the upload script of the platform 、HTTP(S) and IP Patrol method: Customize patrol items , According to the access specification, new patrol items can be quickly expanded ;

Subscribe to reports and alarms : It can be done by email 、 WeChat 、 Subscribe to patrol reports of different patrol items by SMS, etc , You can also subscribe to the latest patrol results in time according to the alarm level .

03 Product architecture

application layer

The application layer acts as the unified entrance of the patrol platform , Provide patrol item management 、 Plan management 、 task management 、 Report detailed view, download and other functions .

overview : Display the number of alarms in the historical report ( serious / Warning / remind ) Sort TOP5 And trend chart 、 A trend chart showing the number of historical plans and tasks 、 Display the distribution proportion and quantity of patrol inspection in the historical report , Including products 、 Patrol Group 、 Alarm category .

Inspection plan : Customize the patrol items of the patrol plan 、 Frequency of execution 、 Alarm receiver , Start stop operation .

Patrol item : Support script patrol 、HTTP(S) On-Site Inspection 、IP On-Site Inspection (TCP、UDP、ICMP Agreements, etc , It supports setting timeout for patrol items , It supports configuring alarm rules for patrol results .

Patrol mission : Check the progress of the task 、 Stop task 、 Check the task log .

Inspection report : Support to view the overview and details of patrol report , And can download HTML/CSV Report in format .

Storage layer

The storage layer is responsible for saving the relevant data of the patrol platform , Divide into Etcd、MySQL、 Package management 3 A module :

  • Etcd: Store dynamic data , For example, patrol inspection plan 、 Patrol item 、 Patrol inspection tasks in execution, etc , Introduce here Etcd Mainly used Watch Mechanism to realize the scheduled scheduling of plans and the triggering of patrol inspection tasks ;
  • MySQL: Used to store static data , For example, patrol inspection report 、 Execute the completed patrol inspection task 、 Operation records, etc ;
  • Package management : Package management is an independent service outside the patrol platform , Software package uploading is provided 、 download 、 The function of version management , Package management is introduced here to manage patrol scripts .

Logic layer & Scheduling layer

Logic layer & The scheduling layer is responsible for the scheduling and execution of patrol inspection tasks 、 Results collection 、 Rule judgment 、 Report sending and other core logic . The patrol inspection platform realizes the patrol inspection task scheduling and execution with the help of the process choreography engine , The process choreography engine is the basic component of the VPC automation operation and maintenance , It provides the ability of process choreography and process scheduling to the designated machine for execution , It provides time-out control in the orchestration capability 、 Sub process 、 Branch splitting, merging, etc , In terms of execution capability, it provides command execution 、 Perform result collection 、 Execution result context sharing, etc , These capabilities are sufficient to cover the needs of the patrol platform :

  • InspectionItem Controller: Be responsible for translating the inspection items of the inspection platform into the process template recognized by the process choreography engine , At the same time, maintain the mapping relationship between patrol items and process templates , Each patrol item corresponds to a process template in the process choreography engine ;
  • Job Controller: Responsible for consumption Etcd Patrol task in , A patrol task is associated with one or more patrol items . First, a parent process template is generated according to the patrol items associated with the patrol task , Each patrol item is associated with one of the parent process templates SubWorkflow Type of Node Corresponding ; Then a process instance will be created with this parent process template , Each patrol task corresponds to a process instance of the process choreography engine , After the instance is created, the status of the process instance will be queried continuously , Until the process instance is in the final state (Succeeded、Failed); Then, the patrol inspection results will be collected according to the execution output of the process instance ; Finally, judge the rules according to the inspection results 、 Patrol report generation and sending through the message platform ;
  • CronJob Controller: Responsible for periodically creating patrol inspection tasks , It's actually going to Etcd Produce a patrol inspection task .

Under large-scale patrol inspection , In order to ensure the stability of the patrol platform and prevent a large number of concurrent tasks from exploding the patrol target , The patrol platform provides a concurrency control strategy 、 Timeout control strategy . In terms of concurrency control, it provides 3 There are three strategies for users to choose , Users can select corresponding policies according to different business scenarios :

  • AllowConcurrent: Allow concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will normally create patrol tasks , Run two tasks at the same time ;
  • ForbidConcurrent: Prohibit concurrency policy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will give up the task of creating this point in time , Until the last patrol inspection task is completed ;
  • ReplaceConcurrent: Replacement strategy , If the previous patrol inspection task is not finished and the next scheduled time point is reached , At this time, the scheduler will stop the execution of the previous task , Then create a new patrol task ;

In terms of timeout control , You can set the timeout for each patrol item , The patrol items that have timed out will be automatically kill Continue to execute the next patrol item .

Executive

The execution layer consists of one or more execution machines , Each execution machine deploys the process engine command channel dependency agent、python Environmental Science , from agent Execution comes from the logical layer & Patrol inspection tasks issued by the dispatching layer , The execution result of the task passes stdout Or common functions set_output The exported data is collected to the patrol inspection platform . Patrol inspection depends on the stability of the target service , In case of network jitter, patrol failure may occur , In order to weaken the impact of these environmental factors , Elegant retry mechanism is added in the execution layer , It will retry when the patrol fails , The default number of retries is 3 Time , It will wait for a random time within a certain range and then try again .

04 Effect of implementation

Built in patrol items

The patrol platform has built-in 400+ Patrol item , Covers the calculation 、 The Internet 、 Storage 、 Platform and other core products , Patrol types cover availability 、 reliability 、 performance 、 The water level 、 Security , These built-in patrol items are available out of the box , One click patrol , No technical threshold .

Inspection results

The patrol inspection platform has been running stably for more than half a year , Access 400+ Patrol items , The patrol item covers most products of the VPC , Yes 30000+ Patrol inspection task , All in all 200000+ There are system hidden dangers .

lately 7 The trend chart of system hidden dangers found in the daily inspection is as follows :

lately 7 Daily patrol inspection report - The distribution of patrol inspection items is as follows :

At present, the patrol inspection platform standardizes patrol inspection by formulating patrol inspection item development specifications , The patrol inspection process is realized by scheduling and executing patrol inspection through the platform 、 automation , However, the hidden dangers found after patrol inspection still need to be solved manually , Unable to perform patrol inspection + Governance is fully automated , The next step will be linkage intelligent diagnosis 、 Try to explore big data analysis , Cooperate with the operation and maintenance platform to realize automatic processing .

-END-

原网站

版权声明
本文为[Tencent proprietary cloud]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241734042961.html