当前位置:网站首页>[log service CLS] experience of troubleshooting abnormal scenarios with tke event log
[log service CLS] experience of troubleshooting abnormal scenarios with tke event log
2022-06-24 01:12:00 【Log service CLS assistant】
author :v god
Introduction :
Cloud native log service (Cloud Log Service,CLS) It is a one-stop service provided by Tencent cloud Log data Solution platform , Provides data collection from logs 、 Log storage to log retrieval , Chart analysis 、 Monitoring alarm 、 Log delivery and other services , Assist users to solve business problems through logs Operation and maintenance 、 Service monitoring 、 Log audit and other scenarios .
Tencent cloud container service (Tencent Kubernetes Engine,TKE) It's based on the original kubernetes Provide container centered 、 Highly scalable high performance container management services , You can easily run applications on a cluster of managed cloud server instances . At the same time, Tencent cloud also provides Elastic container service (Elastic Kubernetes Service,EKS) and Edge container service (Tencent Kubernetes Engine for Edge,TKE Edge), It is convenient for you to choose .
The situation in the cluster is endless , change constantly , For example, the node status is abnormal ,Pod Restart, etc. , If you can't perceive the situation at the first time , You will miss the best time to deal with the problem , When the problem expands , It is often too late to find out when it affects the business .
And the event log (Event) Comprehensive cluster state change information is recorded , It can not only help users find problems at the first time , It's also the best helper for troubleshooting .
What is an event log
event (Event) yes Kubernetes One of the many resource objects in , It is usually used to record the state changes in the cluster , The cluster node is abnormal , Small to Pod start-up 、 Dispatch success and so on . That we use a lot kubectl describe Command to view the event information of related resources .
Event log field description
- Level (Type): At present, only “Normal” and “Warning”, But if you need to , You can use custom types .
- The resource type / object (Involved Object): The object of the event , for example Pod,Deployment,Node etc. .
- Event source (Source): The component that reports this event ; Such as Scheduler、Kubelet etc. .
- Content (Reason): A brief description of the current event , Generally, it is an enumeration value , Mainly used within the program .
- Detailed description (Message): A detailed description of the current event .
- Number of occurrences (Count): The number of events .
How to use the event log to troubleshoot problems
The log service CLS Provide targeted kubernetes One stop service for event log , Including collection , Storage , Retrieval and analysis ability . The user only needs one key to start the cluster event log function , Get out of the box event log visual analysis dashboard . Through the visualization chart , Users can easily solve most common O & M problems through the console , Let's take a look at how to use the event log to troubleshoot problems .
Prerequisite : User purchase TKE Container services , Start the cluster event log , Please refer to Operation guide
scene 1: a Node Node exception , Positioning reason
Get into TKE Container service console , Click... In the menu on the left 【 Cluster operation and maintenance 】>【 Event Retrieval 】. stay 【 Event Retrieval 】 page , Click on 【 An overview of the event 】, Enter the exception node name in the filter .
The query results show , There is a The node is out of disk space The query result of event record is shown in the figure below :
Take a closer look at the trend of abnormal events :
You can find ,2020-11-25 The start , node 172.16.18.13 Node exception due to insufficient disk space , thereafter kubelet Start trying to evict... On nodes pod To reclaim node disk space .
scene 2: The node triggered capacity expansion , Users need to trace back the expansion process , To determine the specific cause
Open the Node pool 「 Stretch and stretch 」 The cluster of ,CA(cluster-autoscler) The component will automatically increase or decrease the number of nodes in the cluster according to the load status . If the nodes in the cluster are automatically expanded ( shrink ) Rong , Users can retrieve the whole extension through event retrieval ( shrink ) Let the process go back .
stay 【 Event Retrieval 】 page , Click on 【 Global search 】, Enter the following search command :
event.source.component : "cluster-autoscaler"
Select in the left hidden field event.reason、event.message、event.involvedObject.name、event.involvedObject.name Display , Query results according to Log time Reverse order , The results are shown in the following figure :
Through the above event flow , You can see that the node expansion operation is in 2020-11-25 20:35:45 about , There are three nginx Pod(nginx-5dbf784b68-tq8rd、nginx-5dbf784b68-fpvbx、nginx-5dbf784b68-v9jv5) Trigger , Finally, it expanded 3 Nodes , The subsequent expansion is not triggered again due to the maximum number of nodes in the node pool .
The above is the current issue TKE Game analysis of event log , If you have more interesting logging practices , Welcome to contribute !
The articles :
【 The log service CLS】CentOs Access notes
【 The log service CLS】 Apply workflow ASW Access CLS Practice sharing
【 The log service CLS】 Tencent cloud Log4j/Logback Log collection best practices
【 The log service CLS】Nginx Access log access Tencent cloud log service
【 The log service CLS】 First met Tencent CLS High speed retrieval and Nginx Pre alarm service ~
边栏推荐
- The concept of TP FP TN FN in machine learning
- Theoretical analysis of countermeasure training: adaptive step size fast countermeasure training
- Messy knowledge points
- Server performance monitoring: Best Practices for server monitoring
- Map data types in golang
- [planting grass by technology] 13 years' record of the prince of wool collecting on the cloud moving to Tencent cloud
- 分别用SVM、贝叶斯分类、二叉树、CNN实现手写数字识别
- 【SPRS J P & RS 2022】小目标检测模块:A Normalized Gaussian Wasserstein Distance for Tiny Object Detection
- 所见之处都是我精准定位的范畴!显著图可视化新方法开源
- [redis advanced ziplist] if someone asks you what is a compressed list? Please dump this article directly to him.
猜你喜欢

Efficient integration of heterogeneous single cell transcriptome with scanorama

Cross domain and jsonp
![[applet] when compiling the preview applet, a -80063 error prompt appears](/img/4e/722d76aa0ca3576164fbed4e2c4db2.png)
[applet] when compiling the preview applet, a -80063 error prompt appears

GNN上分利器!与其绞尽脑汁炼丹,不如给你的GNN撒点trick吧

Everything I see is the category of my precise positioning! Open source of a new method for saliency map visualization

Shardingsphere-proxy-5.0.0 implementation of capacity range partition (V)

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!

Installation and use of winscp and putty

JS input / output statements, variables

WinSCP和PuTTY的安装和使用
随机推荐
Architecture solutions
WinSCP和PuTTY的安装和使用
GNN upper edge distributor! Instead of trying to refine pills, you might as well give your GNN some tricks
Messy knowledge points
JS input / output statements, variables
7 tips for preventing DDoS Attacks
【Redis进阶之ZipList】如果再有人问你什么是压缩列表?请把这篇文章直接甩给他。
Part of the problem solution of unctf2020
What are the two types of digital factories
【ICCV Workshop 2021】基于密度图的小目标检测:Coarse-grained Density Map Guided Object Detection in Aerial Images
[Hongke case] how can 3D data become operable information Object detection and tracking
VS2022保存格式化插件
GNN上分利器!与其绞尽脑汁炼丹,不如给你的GNN撒点trick吧
ShardingSphere-proxy-5.0.0容量范围分片的实现(五)
Definition of logic
ICML'22 | ProGCL: 重新思考图对比学习中的难样本挖掘
Icml'22 | progcl: rethinking difficult sample mining in graph contrast learning
What problems need to be solved by MES management system in the era of intelligent manufacturing
LSF opens job idle information to view the CPU time/elapse time usage of the job
Application configuration management, basic principle analysis