当前位置：网站首页>[log service CLS] experience of troubleshooting abnormal scenarios with tke event log

[log service CLS] experience of troubleshooting abnormal scenarios with tke event log

2022-06-24 01:12:00 【Log service CLS assistant】

author ：v god

Introduction ：
Cloud native log service （Cloud Log Service,CLS） It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .
Tencent cloud container service （Tencent Kubernetes Engine,TKE） It's based on the original kubernetes Provide container centered 、 Highly scalable high performance container management services , You can easily run applications on a cluster of managed cloud server instances . At the same time, Tencent cloud also provides Elastic container service （Elastic Kubernetes Service,EKS） and Edge container service （Tencent Kubernetes Engine for Edge,TKE Edge）, It is convenient for you to choose .

The situation in the cluster is endless , change constantly , For example, the node status is abnormal ,Pod Restart, etc. , If you can't perceive the situation at the first time , You will miss the best time to deal with the problem , When the problem expands , It is often too late to find out when it affects the business .

And the event log （Event） Comprehensive cluster state change information is recorded , It can not only help users find problems at the first time , It's also the best helper for troubleshooting .

What is an event log

event （Event） yes Kubernetes One of the many resource objects in , It is usually used to record the state changes in the cluster , The cluster node is abnormal , Small to Pod start-up 、 Dispatch success and so on . That we use a lot kubectl describe Command to view the event information of related resources .

Event log field description

Event log fields

Level （Type）： At present, only “Normal” and “Warning”, But if you need to , You can use custom types .
The resource type / object (Involved Object)： The object of the event , for example Pod,Deployment,Node etc. .
Event source （Source）： The component that reports this event ; Such as Scheduler、Kubelet etc. .
Content （Reason）： A brief description of the current event , Generally, it is an enumeration value , Mainly used within the program .
Detailed description （Message）： A detailed description of the current event .
Number of occurrences （Count）： The number of events .

How to use the event log to troubleshoot problems

The log service CLS Provide targeted kubernetes One stop service for event log , Including collection , Storage , Retrieval and analysis ability . The user only needs one key to start the cluster event log function , Get out of the box event log visual analysis dashboard . Through the visualization chart , Users can easily solve most common O & M problems through the console , Let's take a look at how to use the event log to troubleshoot problems .

Prerequisite ： User purchase TKE Container services , Start the cluster event log , Please refer to Operation guide

scene 1： a Node Node exception , Positioning reason

Get into TKE Container service console , Click... In the menu on the left 【 Cluster operation and maintenance 】>【 Event Retrieval 】. stay 【 Event Retrieval 】 page , Click on 【 An overview of the event 】, Enter the exception node name in the filter .

Locate the exception node

The query results show , There is a The node is out of disk space The query result of event record is shown in the figure below ：

Event result query record

Take a closer look at the trend of abnormal events ：

Abnormal event trend

You can find ,2020-11-25 The start , node 172.16.18.13 Node exception due to insufficient disk space , thereafter kubelet Start trying to evict... On nodes pod To reclaim node disk space .

scene 2： The node triggered capacity expansion , Users need to trace back the expansion process , To determine the specific cause

Open the Node pool 「 Stretch and stretch 」 The cluster of ,CA（cluster-autoscler） The component will automatically increase or decrease the number of nodes in the cluster according to the load status . If the nodes in the cluster are automatically expanded （ shrink ） Rong , Users can retrieve the whole extension through event retrieval （ shrink ） Let the process go back .

stay 【 Event Retrieval 】 page , Click on 【 Global search 】, Enter the following search command ：

event.source.component : "cluster-autoscaler"

Select in the left hidden field event.reason、event.message、event.involvedObject.name、event.involvedObject.name Display , Query results according to Log time Reverse order , The results are shown in the following figure ：

Log query results

Through the above event flow , You can see that the node expansion operation is in 2020-11-25 20:35:45 about , There are three nginx Pod(nginx-5dbf784b68-tq8rd、nginx-5dbf784b68-fpvbx、nginx-5dbf784b68-v9jv5) Trigger , Finally, it expanded 3 Nodes , The subsequent expansion is not triggered again due to the maximum number of nodes in the node pool .

The above is the current issue TKE Game analysis of event log , If you have more interesting logging practices , Welcome to contribute ！

One stop log data solution platform

The articles ：

【 The log service CLS】CentOs Access notes

【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing

【 The log service CLS】 Tencent cloud Log4j/Logback Log collection best practices

【 The log service CLS】Nginx Access log access Tencent cloud log service

【 The log service CLS】 First met Tencent CLS High speed retrieval and Nginx Pre alarm service ～

原网站

版权声明
本文为[Log service CLS assistant]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/11/20211120000108386R.html