当前位置:网站首页>[log service CLS] experience of troubleshooting abnormal scenarios with tke event log

[log service CLS] experience of troubleshooting abnormal scenarios with tke event log

2022-06-24 01:12:00 Log service CLS assistant

author :v god

Introduction :

Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .

Tencent cloud container service (Tencent Kubernetes Engine,TKE) It's based on the original kubernetes Provide container centered 、 Highly scalable high performance container management services , You can easily run applications on a cluster of managed cloud server instances . At the same time, Tencent cloud also provides   Elastic container service (Elastic Kubernetes Service,EKS) and   Edge container service (Tencent Kubernetes Engine for Edge,TKE Edge), It is convenient for you to choose .

The situation in the cluster is endless , change constantly , For example, the node status is abnormal ,Pod Restart, etc. , If you can't perceive the situation at the first time , You will miss the best time to deal with the problem , When the problem expands , It is often too late to find out when it affects the business .

And the event log (Event) Comprehensive cluster state change information is recorded , It can not only help users find problems at the first time , It's also the best helper for troubleshooting .

What is an event log

event (Event) yes Kubernetes One of the many resource objects in , It is usually used to record the state changes in the cluster , The cluster node is abnormal , Small to Pod start-up 、 Dispatch success and so on . That we use a lot kubectl describe Command to view the event information of related resources .

Event log field description

Event log fields
  • Level (Type): At present, only “Normal” and “Warning”, But if you need to , You can use custom types .
  • The resource type / object (Involved Object): The object of the event , for example Pod,Deployment,Node etc. .
  • Event source (Source): The component that reports this event ; Such as Scheduler、Kubelet etc. .
  • Content (Reason): A brief description of the current event , Generally, it is an enumeration value , Mainly used within the program .
  • Detailed description (Message): A detailed description of the current event .
  • Number of occurrences (Count): The number of events .

How to use the event log to troubleshoot problems

The log service CLS Provide targeted kubernetes One stop service for event log , Including collection , Storage , Retrieval and analysis ability . The user only needs one key to start the cluster event log function , Get out of the box event log visual analysis dashboard . Through the visualization chart , Users can easily solve most common O & M problems through the console , Let's take a look at how to use the event log to troubleshoot problems .

Prerequisite : User purchase TKE Container services , Start the cluster event log , Please refer to Operation guide

scene 1: a Node Node exception , Positioning reason

Get into TKE Container service console , Click... In the menu on the left 【 Cluster operation and maintenance 】>【 Event Retrieval 】. stay 【 Event Retrieval 】 page , Click on 【 An overview of the event 】, Enter the exception node name in the filter .

Locate the exception node

The query results show , There is a The node is out of disk space The query result of event record is shown in the figure below :

Event result query record

Take a closer look at the trend of abnormal events :

Abnormal event trend

You can find ,2020-11-25 The start , node 172.16.18.13 Node exception due to insufficient disk space , thereafter kubelet Start trying to evict... On nodes pod To reclaim node disk space .

scene 2: The node triggered capacity expansion , Users need to trace back the expansion process , To determine the specific cause

Open the Node pool 「 Stretch and stretch 」 The cluster of ,CA(cluster-autoscler) The component will automatically increase or decrease the number of nodes in the cluster according to the load status . If the nodes in the cluster are automatically expanded ( shrink ) Rong , Users can retrieve the whole extension through event retrieval ( shrink ) Let the process go back .

stay 【 Event Retrieval 】 page , Click on 【 Global search 】, Enter the following search command :

event.source.component : "cluster-autoscaler"

Select in the left hidden field event.reasonevent.messageevent.involvedObject.nameevent.involvedObject.name Display , Query results according to Log time Reverse order , The results are shown in the following figure :

Log query results

Through the above event flow , You can see that the node expansion operation is in 2020-11-25 20:35:45 about , There are three nginx Pod(nginx-5dbf784b68-tq8rd、nginx-5dbf784b68-fpvbx、nginx-5dbf784b68-v9jv5) Trigger , Finally, it expanded 3 Nodes , The subsequent expansion is not triggered again due to the maximum number of nodes in the node pool .


The above is the current issue TKE Game analysis of event log , If you have more interesting logging practices , Welcome to contribute !

One stop log data solution platform

The articles :

【 The log service CLS】CentOs Access notes

【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing

【 The log service CLS】 Tencent cloud Log4j/Logback Log collection best practices

【 The log service CLS】Nginx Access log access Tencent cloud log service

【 The log service CLS】 First met Tencent CLS High speed retrieval and Nginx Pre alarm service ~

原网站

版权声明
本文为[Log service CLS assistant]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/11/20211120000108386R.html