当前位置:网站首页>[log service CLS] experience of troubleshooting abnormal scenarios with tke event log
[log service CLS] experience of troubleshooting abnormal scenarios with tke event log
2022-06-24 01:12:00 【Log service CLS assistant】
author :v god
Introduction :
Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .
Tencent cloud container service (Tencent Kubernetes Engine,TKE) It's based on the original kubernetes Provide container centered 、 Highly scalable high performance container management services , You can easily run applications on a cluster of managed cloud server instances . At the same time, Tencent cloud also provides Elastic container service (Elastic Kubernetes Service,EKS) and Edge container service (Tencent Kubernetes Engine for Edge,TKE Edge), It is convenient for you to choose .
The situation in the cluster is endless , change constantly , For example, the node status is abnormal ,Pod Restart, etc. , If you can't perceive the situation at the first time , You will miss the best time to deal with the problem , When the problem expands , It is often too late to find out when it affects the business .
And the event log (Event) Comprehensive cluster state change information is recorded , It can not only help users find problems at the first time , It's also the best helper for troubleshooting .
What is an event log
event (Event) yes Kubernetes One of the many resource objects in , It is usually used to record the state changes in the cluster , The cluster node is abnormal , Small to Pod start-up 、 Dispatch success and so on . That we use a lot kubectl describe Command to view the event information of related resources .
Event log field description
- Level (Type): At present, only “Normal” and “Warning”, But if you need to , You can use custom types .
- The resource type / object (Involved Object): The object of the event , for example Pod,Deployment,Node etc. .
- Event source (Source): The component that reports this event ; Such as Scheduler、Kubelet etc. .
- Content (Reason): A brief description of the current event , Generally, it is an enumeration value , Mainly used within the program .
- Detailed description (Message): A detailed description of the current event .
- Number of occurrences (Count): The number of events .
How to use the event log to troubleshoot problems
The log service CLS Provide targeted kubernetes One stop service for event log , Including collection , Storage , Retrieval and analysis ability . The user only needs one key to start the cluster event log function , Get out of the box event log visual analysis dashboard . Through the visualization chart , Users can easily solve most common O & M problems through the console , Let's take a look at how to use the event log to troubleshoot problems .
Prerequisite : User purchase TKE Container services , Start the cluster event log , Please refer to Operation guide
scene 1: a Node Node exception , Positioning reason
Get into TKE Container service console , Click... In the menu on the left 【 Cluster operation and maintenance 】>【 Event Retrieval 】. stay 【 Event Retrieval 】 page , Click on 【 An overview of the event 】, Enter the exception node name in the filter .
The query results show , There is a The node is out of disk space The query result of event record is shown in the figure below :
Take a closer look at the trend of abnormal events :
You can find ,2020-11-25 The start , node 172.16.18.13 Node exception due to insufficient disk space , thereafter kubelet Start trying to evict... On nodes pod To reclaim node disk space .
scene 2: The node triggered capacity expansion , Users need to trace back the expansion process , To determine the specific cause
Open the Node pool 「 Stretch and stretch 」 The cluster of ,CA(cluster-autoscler) The component will automatically increase or decrease the number of nodes in the cluster according to the load status . If the nodes in the cluster are automatically expanded ( shrink ) Rong , Users can retrieve the whole extension through event retrieval ( shrink ) Let the process go back .
stay 【 Event Retrieval 】 page , Click on 【 Global search 】, Enter the following search command :
event.source.component : "cluster-autoscaler"
Select in the left hidden field event.reason、event.message、event.involvedObject.name、event.involvedObject.name Display , Query results according to Log time Reverse order , The results are shown in the following figure :
Through the above event flow , You can see that the node expansion operation is in 2020-11-25 20:35:45 about , There are three nginx Pod(nginx-5dbf784b68-tq8rd、nginx-5dbf784b68-fpvbx、nginx-5dbf784b68-v9jv5) Trigger , Finally, it expanded 3 Nodes , The subsequent expansion is not triggered again due to the maximum number of nodes in the node pool .
The above is the current issue TKE Game analysis of event log , If you have more interesting logging practices , Welcome to contribute !
The articles :
【 The log service CLS】CentOs Access notes
【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing
【 The log service CLS】 Tencent cloud Log4j/Logback Log collection best practices
【 The log service CLS】Nginx Access log access Tencent cloud log service
【 The log service CLS】 First met Tencent CLS High speed retrieval and Nginx Pre alarm service ~
边栏推荐
- Basic DDoS commands
- [Hongke case] how can 3D data become operable information Object detection and tracking
- How to view kubernetes API traffic by grabbing packets
- Installation and use of winscp and putty
- [SPRS J P & RS 2022] small target detection module: a normalized Gaussian Wasserstein distance for tiny object detection
- [machine learning] linear regression prediction
- numpy. linalg. Lstsq (a, B, rcond=-1) parsing
- 实时计算框架:Spark集群搭建与入门案例
- 飞桨产业级开源模型库:加速企业AI任务开发与应用
- Forward design of business application data technology architecture
猜你喜欢

【小程序】实现双列商品效果

【小程序】编译预览小程序时,出现-80063错误提示
![[shutter] how to use shutter packages and plug-ins](/img/a6/e494dcdb2d3830b6d6c24d0ee05af2.png)
[shutter] how to use shutter packages and plug-ins

Use recursion to form a multi-level directory tree structure, with possibly the most detailed notes of the whole network.

985 Android programmers won the oral offer of Alibaba P6 in 40 days. After the successful interview, they sorted out these interview ideas

C language: on the right shift of matrix

WinSCP和PuTTY的安装和使用

Arm learning (7) symbol table and debugging

同行评议论文怎么写

C language: structure array implementation to find the lowest student record
随机推荐
Is it safe to open an account for shares of tongdaxin?
7 tips for preventing DDoS Attacks
[Hongke case] how can 3D data become operable information Object detection and tracking
How to build a "preemptive" remote control system (- - memory chapter)
Handwritten digit recognition using SVM, Bayesian classification, binary tree and CNN
Is it safe to open a stock account online now? Select a state-owned securities firm, and the fastest time to open an account is 8 minutes
[applet] when compiling the preview applet, a -80063 error prompt appears
GNN上分利器!与其绞尽脑汁炼丹,不如给你的GNN撒点trick吧
Version ` zlib 1.2.9 "not found (required by / lib64 / libpng16.so.16)
JS input / output statements, variables
LMS Virtual. Derivation method of lab acoustic simulation results
What problems need to be solved by MES management system in the era of intelligent manufacturing
Devops culture: Amazon leadership principles
2021-11-21: map[i][j] = = 0, which means that (I, J) is an ocean. If you cross it, the cost will be
Error reported using worker: uncaught domexception: failed to construct 'worker': script at***
How many of the 36 difficult points of activity do you know?, Android interview 2020
Longest substring without duplicate characters
DML operation
Shardingsphere-proxy-5.0.0 implementation of capacity range partition (V)
Real time computing framework: Flink cluster construction and operation mechanism