background

There is a service in the company project , Similar to reptiles , You need to parse the given URL, From the return HTML Extract the title of the page in 、 Cover image 、 Abstract 、icon Etc . Because this is a non DB Access pure memory services , And downstream services （ To be resolved URL Address ） Not an internal service , There is no need to consider concurrent stress , Select... When building the service WebFlux As web Layer framework , choose spring Of WebClient As a request for downstream services HTTP client .

The service is deployed on k8s In container ,JDK Version is OpenJDK11,Pod To configure 4C4G,Java Service configuration maximum heap memory 2G.

Problem description

After the service went online, there was little pressure to request , But after a long run , The service heap memory usage reaches 99%, There are a lot of problems in log monitoring OOM Report errors , Then the container Pod restart . It can work normally for a period of time after restart , Then the heap memory is occupied again 99%, appear OOM Report errors .

To solve the process

Preliminary analysis

Monitoring through containers , see Pod Graph of machine memory usage for a period of time before restart , It is found that the chart shows a continuous upward trend , And after reaching the maximum heap memory allocation ,Pod A restart occurs . Preliminary speculation is that a memory leak has occurred .
file

Use jmap -histo:live 1 View the distribution of living objects , Find out byte Arrays take up too much memory , And PoolSubpage The number of objects is also large , The suspicion is netty There was a memory leak .
file

screening ELK Medium ERROR journal , except OOM Error reporting , A small amount of netty Error message , The exception stack is as follows ：

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
    io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:332)
    io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:168)
    io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:159)
    io.netty.handler.codec.compression.JdkZlibDecoder.decode(JdkZlibDecoder.java:180)
    io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:493)
    io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:432)
    ...

It can be seen from the exception prompt ,netty Heap memory for ByteBuf To be released without being released GC Recycling , and netty Use memory pool for heap memory management , Such as ByteBuff Not passed release() The method call is GC Recycling , This will cause the reference count of a large number of memory blocks in the memory pool to fail to return to zero , The memory cannot be recycled . And ByteBuf By GC After recycling , The application can no longer call release() Method , This causes a memory leak .

Locate where the problem occurred

Project use netty There are ：Redisson、WebFlux、WebClient. Considering that the third-party libraries are mature , It has been applied in many commercial projects , The problem is unlikely to occur in the library code , Maybe it's the wrong way to use it . Self coding in the application mainly uses WebClient, Used to request third-party pages HTML.

Business usage scenarios , Read required ResponseHeader and ResponseBody Two parts .Header For from Content-Type Parsing code in ;Body Used to read binary data directly , Determine the true encoding format of the page .

The reason why we need to determine the real encoding format of the page , Because some third-party pages ,response header Pass through Content-Type The declaration encoding format is UTF-8, But the real encoding format is GBK or GB2312, It causes garbled code when parsing Chinese abstracts . Therefore, you need to read the binary stream , Determine the real encoding format according to the stream content . Brothers who have written about reptiles should understand .

WebClient The following multiple acquisitions are provided Response Methods ：

WebClient.RequestHeadersSpec#retrieve
Can be body Handle directly as an object of the specified type , But you can't operate directly response;
WebClient.RequestHeadersSpec#exchange
Can be operated directly response, but body The read operation of needs to be handled by itself ;

To meet the needs , Used in the project WebClient.RequestHeadersSpec#exchange Method , This is also the only place in the project that can be operated directly ByteBuf Where the data is . When using this method , Only data reading operation is performed , No release body. On the annotation of the method , There is just such a paragraph ：
file

NOTE Part of the translation roughly means ：

And retrieve() Different , In the use of exchange() when , In any case （ success 、 abnormal 、 Unprocessable data, etc ）, Applications should consume the response content . Failure to do so may result in a memory leak . see also ClientResponse In order to obtain the information that can be used for consumption body The way . Usually, you should use retrieve(), Unless you have good reason to use exchange(), It allows you to check the response status and title , And then used to decide whether to consume body、 How to consume body.

And just when some business verification fails , Such as Content-Type The data returned by the ID in is not HTML Content time , The application code directly return, Without consumption body, Caused a memory leak .

//  Sample request code 
WebClient.builder().build()
    .get()
    .uri(ctx.getUri())
    .headers(headers -> {
        headers.set(HttpHeaders.USER_AGENT, CHROME_AGENT);
        headers.set(HttpHeaders.HOST, ctx.getUri().getHost());
    })
    .cookies(cookies -> ctx.getCookies().forEach(cookies::add))
    .exchange()
    .flatMap(response -> {
        //  Check again whether the timeout occurs 
        //  Be careful , It's straight back here Mono.error, Without releasing response
        if (ctx.isParseTimeout(PARSE_TIMEOUT)) {
            return Mono.error(ReadTimeoutException.INSTANCE);
        }

        //  First resolve the redirection , If there is no redirection, resolve body
        return judgeRedirect(response, ctx)
                .flatMap(redirectTo -> followRedirect(ctx, redirectTo))
                .switchIfEmpty(Mono.defer(() -> Mono.just(parser.parse(ctx))))
                .map(LinkParseResult::detectParseFail);
    })

solve the problem

The cause of the problem has been identified , And the official documents have given the solution Refer to the ClientResponse In order to obtain the information that can be used for consumption body The way . stay ClientResponse Interface comments , List all items for consumption Response Methods ：
file

The role of each method will not be repeated , According to business scenarios , Should not need to consume body Called when the releaseBody() Method to release . The modified code is as follows ：

//  Sample request code 
WebClient.builder().build()
    .get()
    .uri(ctx.getUri())
    .headers(headers -> {
        headers.set(HttpHeaders.USER_AGENT, CHROME_AGENT);
        headers.set(HttpHeaders.HOST, ctx.getUri().getHost());
    })
    .cookies(cookies -> ctx.getCookies().forEach(cookies::add))
    .exchange()
    .flatMap(response -> {
        //  Check again whether the timeout occurs , And release response
        if (ctx.isParseTimeout(PARSE_TIMEOUT)) {
            return response.releaseBody()
                    .then(Mono.error(ReadTimeoutException.INSTANCE));
        }

        //  First resolve the redirection , If there is no redirection, resolve body
        return judgeRedirect(response, ctx)
                .flatMap(redirectTo -> followRedirect(ctx, redirectTo))
                .switchIfEmpty(Mono.defer(() -> Mono.just(parser.parse(ctx))))
                .map(LinkParseResult::detectParseFail);
    })

summary

In the use of responsive HTTP client WebClient when , Accept response data using exchange() Method , But there are no calls in some process branches ClientResponse#releaseBody() Method , As a result, a large amount of data cannot be released ,netty The memory pool is full , Subsequent requests are reported in the application memory OOM abnormal .

Lessons learned ： When using unfamiliar third-party libraries , Be sure to read the method notes 、 Class notes .

Reference documents ：

Netty Memory leak troubleshooting
Web on Reactive Stack
This article is a platform of operation tools such as blog group sending one article and multiple sending OpenWrite Release