当前位置:网站首页>A case study of apiserver avalanche caused by serviceaccount
A case study of apiserver avalanche caused by serviceaccount
2022-06-24 07:09:00 【shawwang】
background
A business uses k8s1.12 Version of cluster , There are thousands in the cluster node, One day in master After a burst of requests , Suddenly becomes unavailable , A large number of components in the cluster access master Overtime , Business restart master The component has not been restored .
The screening process
First of all kube-apiserver Log , Found a lot of creation TokenReview Request log print for , And are 30s Overtime , It is suspected that the request will kube-apiserver Full speed limit , This affects other normal requests .
Observe kube-apiserver Monitoring discovery ,apiserver Of apiserver_current_inflight_requests{requestKind="mutating"} The indicators have indeed reached the business setting max-mutating-requests-inflight, The speed limit is triggered .
At first, I suspected that it was triggered mutating The speed limit , Led to a large number of client retry , Triggered an avalanche , We've had similar problems before , Increase the speed limit to recover . So I decided to turn it up first mutating Value observation , adjustment max-mutating-requests-inflight after , Observe for a period of time and find , The business still has a timeout , Check the log and find that there are still many Create TokenReview Overtime , It seems that only by checking TokenReview Created from , To find the root cause .
TokenReview Is a virtual resource , Only when there is Token Relevant authentication requests will be created , Therefore, it is urgent to find out which component is requesting authentication frequently . Because the current log cannot find more valid information , Want to find out the source of the request , Only hope for kube-apiserver Check the ultimate killer of the request —— Audit . After configuring the relevant audit rules , We can easily count the source and details of the request .
After a period of observation , Find out TokenReview The request sources of are basically from kubelet, And the requests are relatively uniform , There is no obvious aggregation , It looks like some normal requests , This rule out that it is caused by a burst request from a node .
Under what circumstances kubelet Will send to kube-apiserver Request creation TokenReview Well ? By looking at K8s Source code discovery , On the client side through ServiceAccount Authentication mode request kubelet when ,kubelet By creating TokenReview The way (webhook The way ) request apiserver For authentication ,TokenReview Creation time , Would call kube-apiserver Built in Authenticator authentication , If it is Token authentication , Then check in sequence basic auth, bearertoken,ServiceAccount token,bootstrap token etc. ( Sequence can be referred to BuildAuthenticator Function construction process ), among ,ServiceAccount token Of Authenticator From ServiceAccountTokenGetter adopt loopback client To get secret, The operation is in K8s-1.12 Do not cache , also lookback client The speed limit qps 50,burst 100(SecureServingInfo.NewClientConfig, K8s-1.17 Before ), It's easy to trigger the speed limit during an avalanche , cause max-mutating-requests-inflight Be filled with , This affects other write operations .( notes :kube-apiserver Of token Authentication will add local by default cache, cache 10s. kubelet adopt webhook By token authentication , There is also a local cache ( Default 2 minute ). If the request fails, it will pass backoff And try again . But when the cluster has thousands of nodes and the cache fails , It is difficult to recover automatically after triggering an avalanche ).
Find out the cause of the problem , So how can I recover quickly ? see K8s Code discovery , Current version lookback client The speed limit configuration of is hardcode In the code , No configuration can be modified . If you want to change it, there can only be one more K8s edition , The changes are relatively large .
Look at it in a different way , If you use this ServiceAccount request kubelet Find out the source of , Is it possible to solve this problem ? Check the audit log and find , The requested user is basically the same ServiceAccount,system:serviceaccount:metrics-server. All the questions can be explained here ,metrics-server Need to pass through kubelet To get some monitoring data , Therefore, each node's kubelet, In the case of a large cluster , It's easy to trigger kube-apiserver Of loopback client The speed limit of .
Solution
Find out the cause and source , The problem is easy to solve , take metrics-server The authentication method of is changed to certificate authentication , perhaps static token The way to authenticate , This problem can be solved temporarily .
in addition ,K8s-1.17 Has been removed loopback client The speed limit of ,K8s-1.14 in the future ServiceAccountTokenGetter We'll start with informer get data , Fail and pass loopback client request apiserver, So by upgrading the cluster master edition , Only then can we fundamentally solve this problem .
appendix
K8s Community related discussions :
https://github.com/kubernetes/kubernetes/issues/71811
https://github.com/kubernetes/kubernetes/pull/71816
边栏推荐
- JVM调试工具-jps
- 原神方石机关解密
- Project demo
- 0 foundation a literature club low code development member management applet (4)
- Spark parameter tuning practice
- Arduino融资3200万美元,进军企业市场
- The third session of freshman engineering education seminar is under registration
- 开源与创新
- If you want to learn programming well, don't recite the code!
- Mysql开启BINLOG
猜你喜欢

Leetcode: Sword finger offer 26: judge whether T1 contains all topologies of T2

潞晨科技获邀加入NVIDIA初创加速计划

开源与创新

在js中正则表达式验证小时分钟,将输入的字符串转换为对应的小时和分钟

【愚公系列】2022年6月 ASP.NET Core下CellReport报表工具基本介绍和使用

RealNetworks vs. Microsoft: the battle in the early streaming media industry

虚拟文件系统

机器人迷雾之算力与智能

Spark项目打包优化实践

C language student management system - can check the legitimacy of user input, two-way leading circular linked list
随机推荐
JSON formatting method advantages of JSON over XML
.NET7之MiniAPI(特别篇) :Preview5优化了JWT验证(上)
Asp+access web server reports an error CONN.ASP error 80004005
What is the role of domain name websites? How to query domain name websites
智能视觉组A4纸识别样例
35 year old crisis? It has become a synonym for programmers
I failed to delete the database and run away
Rockscache schematic diagram of cache operation
Leetcode: Sword finger offer 26: judge whether T1 contains all topologies of T2
[cloud based co creation] overview of the IOT of Huawei cloud HCIA IOT v2.5 training series
虚拟文件系统
Record -- about the problem of garbled code when JSP foreground passes parameters to the background
Oracle SQL comprehensive application exercises
Game website making tutorial and correct view of games
How do I turn off win10 automatic update? What are the good ways?
[JUC series] completionfuture of executor framework
Multi sensor fusion track fusion
Implementation and usage analysis of static pod
About Stacked Generalization
华为云低时延技术的九大绝招