当前位置:网站首页>A memory leak caused by timeout scheduling of context and goroutine implementation
A memory leak caused by timeout scheduling of context and goroutine implementation
2022-06-24 16:15:00 【Johns】
background
A project was launched recently , It is necessary to perform a single node pressure test before going online to estimate the deployment plan of each service . When using the pressure measurement master of Tencent cloud for pressure measurement , Found a very interesting situation . First, let's go to the monitoring chart :
First of all, I am 10:00 Left and right 2 Secondary pressure measurement , Each pressure measurement does not exceed 10 minute . It can be downloaded from CPU usage see , Pressure measuring machine CPU Utilization is rising dramatically ,usage_bytes and rss Memory It also rose at that time , The problem is that after the pressure test CPU Usage has dropped , But our memory was not released in the next few hours . Obviously there must be something in the program hang A big chunk of memory .
So I use pprof The tool looks at the following performance indicators of the machine after the test 【 Be careful : I finished the pressure test 10 Minutes later 】:
You can see that there is actually 15318 individual goroutine In the use of , And heap There are 2979 Objects , from Network card traffic trend chart We know , After the pressure test, the network card traffic is basically normal .【 The reason why this is not 0 Because of my test environment, I use scripts to regularly put the test traffic , There should be no such interference flow during actual pressure measurement 】
Enter into goroutine Go inside the details , See where it is hang Live so much goroutine.
You can see /data/ggr/workspace/internal/xxx\_recommend/service/xxx\_recommend\_algo/xxx\_recommend\_algo.go:153 hang Live in the 14235 individual goroutine
And let's see 153 OK, what bad thing did you do
For the sake of convenience , I simplified the code to a test case , as follows :
package xxx_recommend_algo
import (
"context"
"errors"
"testing"
"time"
)
func TestxxxRecommendAlgo(t *testing.T) {
// goroutine A
go func() {
// Set up Context The timeout is 50ms
backGroundCtx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
defer cancel()
// 5.2 adopt GRPC Get the scores of recommended items from the model service on the algorithm side , Set timeout time , If exceeded 30ms It is considered that the model has timed out
xxxRecommendChannel := make(chan *AlgoServingResponse)
// goroutine B
go getXXXRecommend(backGroundCtx, xxxRecommendChannel, t)
select {
case xxxRecommendResult := <-xxxRecommendChannel:
if xxxRecommendResult.err != nil {
return
}
for _, v := range xxxRecommendResult.scores {
t.Log(v)
}
case <-backGroundCtx.Done():
return
}
t.Log(backGroundCtx.Deadline())
}()
time.Sleep(time.Second * 10)
t.Log("ok")
}
func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) {
time.Sleep(time.Second) // Simulate remote rpc request
t.Log("ok1")
xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")}
t.Log("ok2")
}
// The algorithm recommendation service returns results
type AlgoServingResponse struct {
err error
scores map[string]int
}analysis
This code mainly uses Context Implement a timeout call , If the algorithm is in 50ms If you don't return within ,goroutine A It will automatically time out , Instead of waiting for the algorithm to time out ,goroutine B Mainly responsible for rpc Call algorithm service . When the algorithm does not time out , Will not hang live goroutine B, But once the algorithm service times out , that goroutine B already return 了 , here goroutine B Return passage xxxRecommendResult Writing data , Then it will lead to goroutine B Has been blocked in the passage . As the number of timeouts increases , blocked goroutine More and more , It always leads to memory explosion .
We can run the current code , You'll find that ok2 Will never be printed out .
=== RUN TestxxxRecommendAlgo
xxx_recommend_algo_test.go:38: ok1
xxx_recommend_algo_test.go:39: context deadline exceeded
xxx_recommend_algo_test.go:33: ok
--- PASS: TestxxxRecommendAlgo (10.00s)
PASSIf main Do not exit , that goroutine B It's going to keep clogging up !!!
Solution 1
Check before writing data to the channel Context Whether it has timed out , If it's out of date , Just directly return, There is no need to modify elsewhere .
func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) {
time.Sleep(time.Second) // Simulate remote rpc request
t.Log("ok1")
if ctx.Err() == context.Canceled {
xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")}
}
t.Log("ok2")
}Solution 2
A better solution is to control the range of timeout control in the remote scheduling method , Change asynchronous to synchronous , Because I have only one scheduling method , There is no need to open a new one goroutine Go for a run .
package xxx_recommend_algo
import (
"context"
"errors"
"testing"
"time"
)
func TestxxxRecommendAlgo(t *testing.T) {
// goroutine A
go func() {
// Set up Context The timeout is 50ms
backGroundCtx, cancel := context.WithTimeout(context.Background(), 200*time.Millisecond)
defer cancel()
// 5.2 adopt GRPC Get the scores of recommended items from the model service on the algorithm side , Set timeout time , If exceeded 30ms It is considered that the model has timed out
xxxRecommendResult := getXXXRecommend(backGroundCtx, xxxRecommendChannel, t)
if xxxRecommendResult.err != nil{
return nil, xxxRecommendResult.err
}
return xxxRecommendResult.scores, nil
}
time.Sleep(time.Second * 10)
t.Log("ok")
}
func getXXXRecommend(ctx context.Context, xxxRecommendResult chan *AlgoServingResponse, t *testing.T) {
backGroundCtx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
defer cancel()
// Control the timeout to the minimum of real calls
clientConn, err := grpc.DialContext(backGroundCtx, "")
if err == nil {
xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")}
}
defer clientConn.close()
...
t.Log("ok1")
xxxRecommendResult <- &AlgoServingResponse{err: errors.New("error")}
t.Log("ok2")
}
// The algorithm recommendation service returns results
type AlgoServingResponse struct {
err error
scores map[string]int
}summary
【1】 A memory leak does not necessarily cause the program to crash immediately , But any leaks should be disposed of .
【2】Go Unbuffered channels in languages (unbuffered channel) It refers to the channel that does not have the ability to save any value before receiving . This type of channel requires sending goroutine And receiving goroutine At the same time be ready to , To complete the sending and receiving operations .
If two goroutine Not ready at the same time , The channel will cause the first to perform the send or receive operation goroutine Block waiting . The interaction between sending and receiving channels is synchronous in itself . None of these operations can exist alone without the other operation .
【3】Go The cache channel of the language will not block the receiver and sender under normal circumstances , But when the cache pool is full , Will block the transmission , Block the receiver when the cache pool is empty . This must be noted .
边栏推荐
- 一文详解JackSon配置信息
- Three solutions for Jenkins image failing to update plug-in Center
- Detailed explanation of transpose convolution in pytorch
- Is Guotai Junan Futures safe? How to open a futures account? How to reduce the futures commission?
- The catch-up of domestic chips has scared Qualcomm, the leader of mobile phone chips in the United States, and made moves to cope with the competition
- 2021-05-02: given the path of a file directory, write a function
- 中国产品经理的没落:从怀恋乔布斯开始谈起
- 60 divine vs Code plug-ins!!
- 找出隐形资产--利用Hosts碰撞突破边界
- 存在安全隐患 路虎召回部分混动揽运
猜你喜欢

存在安全隐患 部分冒险家混动版将召回

C. Three displays codeforces round 485 (Div. 2)

CAP:多重注意力机制,有趣的细粒度分类方案 | AAAI 2021

60 divine vs Code plug-ins!!

The equipment is connected to the easycvr platform through the national standard gb28181. How to solve the problem of disconnection?

SIGGRAPH 2022 | 真实还原手部肌肉,数字人双手这次有了骨骼、肌肉、皮肤
![[C language questions -- leetcode 12 questions] take you off and fly into the garbage](/img/ca/a356a867f3b7ef2814080fb76b9bfb.png)
[C language questions -- leetcode 12 questions] take you off and fly into the garbage

用 Oasis 开发一个跳一跳(一)—— 场景搭建

The penetration of 5g users of operators is far slower than that of 4G. The popularity of 5g still depends on China Radio and television

Here comes Wi Fi 7. How strong is it?
随机推荐
How to expand disk space on AWS host
Recommend several super practical data analysis tools
Global and Chinese market of inverted syrup 2022-2028: Research Report on technology, participants, trends, market size and share
The penetration of 5g users of operators is far slower than that of 4G. The popularity of 5g still depends on China Radio and television
Goby+awvs realize attack surface detection
Flink kubernetes application deployment
Transpose convolution explanation
Istio FAQ: region awareness does not take effect
[download attached] installation and simple use of Chinese version of awvs
炒期货在哪里开户最正规安全?怎么期货开户?
2021-04-24: handwriting Code: topology sorting.
Goby+AWVS 实现攻击面检测
Detailed explanation of transpose convolution in pytorch
Advanced programmers must know and master. This article explains in detail the principle of MySQL master-slave synchronization
Fastjson 漏洞利用技巧
My network relationship with "apifox"
【附下载】汉化版Awvs安装与简单使用
中国产品经理的没落:从怀恋乔布斯开始谈起
Global and Chinese market for commercial barbecue smokers 2022-2028: Research Report on technology, participants, trends, market size and share
[C language questions -- leetcode 12 questions] take you off and fly into the garbage