当前位置:网站首页>A memory leak caused by timeout scheduling of context and goroutine implementation
A memory leak caused by timeout scheduling of context and goroutine implementation
2022-06-24 16:15:00 【Johns】
background
A project was launched recently , It is necessary to perform a single node pressure test before going online to estimate the deployment plan of each service . When using the pressure measurement master of Tencent cloud for pressure measurement , Found a very interesting situation . First, let's go to the monitoring chart :
First of all, I am 10:00 Left and right 2 Secondary pressure measurement , Each pressure measurement does not exceed 10 minute . It can be downloaded from CPU usage see , Pressure measuring machine CPU Utilization is rising dramatically ,usage_bytes and rss Memory It also rose at that time , The problem is that after the pressure test CPU Usage has dropped , But our memory was not released in the next few hours . Obviously there must be something in the program hang A big chunk of memory .
So I use pprof The tool looks at the following performance indicators of the machine after the test 【 Be careful : I finished the pressure test 10 Minutes later 】:
You can see that there is actually 15318 individual goroutine In the use of , And heap There are 2979 Objects , from Network card traffic trend chart We know , After the pressure test, the network card traffic is basically normal .【 The reason why this is not 0 Because of my test environment, I use scripts to regularly put the test traffic , There should be no such interference flow during actual pressure measurement 】
Enter into goroutine Go inside the details , See where it is hang Live so much goroutine.
You can see /data/ggr/workspace/internal/xxx\_recommend/service/xxx\_recommend\_algo/xxx\_recommend\_algo.go:153
hang Live in the 14235 individual goroutine
And let's see 153 OK, what bad thing did you do
For the sake of convenience , I simplified the code to a test case , as follows :
package xxx_recommend_algo import ( "context" "errors" "testing" "time" ) func TestxxxRecommendAlgo(t *testing.T) { // goroutine A go func() { // Set up Context The timeout is 50ms backGroundCtx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond) defer cancel() // 5.2 adopt GRPC Get the scores of recommended items from the model service on the algorithm side , Set timeout time , If exceeded 30ms It is considered that the model has timed out xxxRecommendChannel := make(chan *AlgoServingResponse) // goroutine B go getXXXRecommend(backGroundCtx, xxxRecommendChannel, t) select { case xxxRecommendResult := <-xxxRecommendChannel: if xxxRecommendResult.err != nil { return } for _, v := range xxxRecommendResult.scores { t.Log(v) } case <-backGroundCtx.Done(): return } t.Log(backGroundCtx.Deadline()) }() time.Sleep(time.Second * 10) t.Log("ok") } func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) { time.Sleep(time.Second) // Simulate remote rpc request t.Log("ok1") xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")} t.Log("ok2") } // The algorithm recommendation service returns results type AlgoServingResponse struct { err error scores map[string]int }
analysis
This code mainly uses Context Implement a timeout call , If the algorithm is in 50ms If you don't return within ,goroutine A It will automatically time out , Instead of waiting for the algorithm to time out ,goroutine B Mainly responsible for rpc Call algorithm service . When the algorithm does not time out , Will not hang live goroutine B, But once the algorithm service times out , that goroutine B already return 了 , here goroutine B Return passage xxxRecommendResult Writing data , Then it will lead to goroutine B Has been blocked in the passage . As the number of timeouts increases , blocked goroutine More and more , It always leads to memory explosion .
We can run the current code , You'll find that ok2 Will never be printed out .
=== RUN TestxxxRecommendAlgo xxx_recommend_algo_test.go:38: ok1 xxx_recommend_algo_test.go:39: context deadline exceeded xxx_recommend_algo_test.go:33: ok --- PASS: TestxxxRecommendAlgo (10.00s) PASS
If main Do not exit , that goroutine B It's going to keep clogging up !!!
Solution 1
Check before writing data to the channel Context Whether it has timed out , If it's out of date , Just directly return, There is no need to modify elsewhere .
func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) { time.Sleep(time.Second) // Simulate remote rpc request t.Log("ok1") if ctx.Err() == context.Canceled { xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")} } t.Log("ok2") }
Solution 2
A better solution is to control the range of timeout control in the remote scheduling method , Change asynchronous to synchronous , Because I have only one scheduling method , There is no need to open a new one goroutine Go for a run .
package xxx_recommend_algo import ( "context" "errors" "testing" "time" ) func TestxxxRecommendAlgo(t *testing.T) { // goroutine A go func() { // Set up Context The timeout is 50ms backGroundCtx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond) defer cancel() // 5.2 adopt GRPC Get the scores of recommended items from the model service on the algorithm side , Set timeout time , If exceeded 30ms It is considered that the model has timed out xxxRecommendResult := getXXXRecommend(backGroundCtx, xxxRecommendChannel, t) if xxxRecommendResult.err != nil{ return nil, xxxRecommendResult.err } return xxxRecommendResult.scores, nil } time.Sleep(time.Second * 10) t.Log("ok") } func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) { backGroundCtx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond) defer cancel() // Control the timeout to the minimum of real calls clientConn, err := grpc.DialContext(backGroundCtx, "") if err == nil { xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")} } defer clientConn.close() ... t.Log("ok1") xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")} t.Log("ok2") } // The algorithm recommendation service returns results type AlgoServingResponse struct { err error scores map[string]int }
summary
【1】 A memory leak does not necessarily cause the program to crash immediately , But any leaks should be disposed of .
【2】Go Unbuffered channels in languages (unbuffered channel) It refers to the channel that does not have the ability to save any value before receiving . This type of channel requires sending goroutine And receiving goroutine At the same time be ready to , To complete the sending and receiving operations .
If two goroutine Not ready at the same time , The channel will cause the first to perform the send or receive operation goroutine Block waiting . The interaction between sending and receiving channels is synchronous in itself . None of these operations can exist alone without the other operation .
【3】Go The cache channel of the language will not block the receiver and sender under normal circumstances , But when the cache pool is full , Will block the transmission , Block the receiver when the cache pool is empty . This must be noted .
边栏推荐
- PyTorch中的转置卷积详解
- Apple is no match for the longest selling mobile phone made in China, and has finally brought back the face of the domestic mobile phone
- Rush for IPO, Hello, I'm in a hurry
- One article explains Jackson configuration information in detail
- April 30, 2021: there are residential areas on a straight line, and the post office can only be built on residential areas. Given an ordered positive array arr
- 2021-04-25: given an array arr and a positive number m, the
- Understanding openstack network
- How to obtain ECS metadata
- Why is it easy for enterprises to fail in implementing WMS warehouse management system
- Global and Chinese markets of stainless steel barbecue ovens 2022-2028: Research Report on technology, participants, trends, market size and share
猜你喜欢
B. Terry sequence (thinking + greed) codeforces round 665 (Div. 2)
使用阿里云RDS for SQL Server性能洞察优化数据库负载-初识性能洞察
[C language questions -- leetcode 12 questions] take you off and fly into the garbage
clang: warning: argument unused during compilation: ‘-no-pie‘ [-Wunused-command-line-argument]
[download attached] installation and simple use of Chinese version of awvs
[my advanced OpenGL learning journey] learning notes of OpenGL coordinate system
Some adventurer hybrid versions with potential safety hazards will be recalled
There are potential safety hazards Land Rover recalls some hybrid vehicles
Using alicloud RDS for SQL Server Performance insight to optimize database load - first understanding of performance insight
The penetration of 5g users of operators is far slower than that of 4G. The popularity of 5g still depends on China Radio and television
随机推荐
存在安全隐患 部分冒险家混动版将召回
Install the imagemagick7.1 library and the imageick extension for PHP
Logging is not as simple as you think
MySQL development specification
One article explains Jackson configuration information in detail
April 30, 2021: there are residential areas on a straight line, and the post office can only be built on residential areas. Given an ordered positive array arr
[cloud native | kubernetes chapter] Introduction to kubernetes Foundation (III)
Global and Chinese markets of natural insect repellents 2022-2028: Research Report on technology, participants, trends, market size and share
2021-04-28: force buckle 546, remove the box. Give some boxes of different colors
企业安全攻击面分析工具
[download attached] installation and simple use of Chinese version of awvs
微信公众号调试与Natapp环境搭建
Istio FAQ: failed to resolve after enabling smart DNS
CAP:多重注意力机制,有趣的细粒度分类方案 | AAAI 2021
Global and Chinese market for commercial barbecue smokers 2022-2028: Research Report on technology, participants, trends, market size and share
Goby+awvs realize attack surface detection
安裝ImageMagick7.1庫以及php的Imagick擴展
The catch-up of domestic chips has scared Qualcomm, the leader of mobile phone chips in the United States, and made moves to cope with the competition
Inter thread communication of embedded development foundation
Here comes Wi Fi 7. How strong is it?