当前位置：网站首页>Through the fog: location notes of Flink crash with a multi component timeout

Through the fog: location notes of Flink crash with a multi component timeout

2022-06-24 04:06:00 【KyleMeow】

Problem phenomenon

Last Thursday afternoon , The alarm system suddenly prompts that the work of a key customer frequently crashes and restarts , The phenomenon is that the job runs 2 About minutes ,JobManager Found TaskManager Loss of heartbeat , The job crashes and restarts , Seriously affect the operation of online business .

TaskManager Sudden loss of contact （ sign out ）

By looking at this lost connection TaskManager Log , I found that it reported a lot ZooKeeper Connection timeout error , Subsequent retries are also unsuccessful , therefore Flink Think that a serious abnormality has occurred , Voluntary order TaskManager sign out .

TaskManager Error log of

Preliminary positioning

Due to the observation that the job crashed more than once , By checking the previous run logs , Also saw a lot of ZooKeeper Connection timeout and error reporting , So first of all ZooKeeper Start with the server .

After investigation , Find out ZooKeeper Everything is normal on the server side , There is no error log , All indicators are in a healthy state . Besides , Suppose that if ZooKeeper If the server fails , Other jobs in the same cluster are likely to be affected , However, no errors were observed in other operations , therefore ZooKeeper The probability of server failure is very small .

And then , Analyzed the customer's JAR package , No incompatibilities have been introduced ZooKeeper Curator Version Library , Therefore, problems such as client version conflicts are basically eliminated .

That's the question ： What is the reason , cause ZooKeeper Connection timeout , And the retry cannot be recovered for a long time ？

Continue to collect error information

Because the job is still crashing and restarting , From the following running examples , There are some new discoveries ：Kafka、Prometheus There is also a timeout phenomenon at the monitoring uplink ：

Kafka A timeout also occurred

Prometheus Monitoring and reporting timeout

These errors reveal potential network problems , For example, the network card of the host machine where the container is located fails , There is a wide range of packet loss 、 Congestion, etc , Will cause the above error report .

But the weird thing is , fault Pod Distributed on different host nodes , And others on these host computers Pod Running normally , And the traffic monitoring of each host is also in the normal range , It doesn't seem like a problem caused by some faulty nodes .

Therefore, the conjecture of network failure has been denied again .

The immediate cause emerges

Since the probability of environmental problems is very small , That is to start with the analysis of the homework itself . By viewing the Full GC frequency , Obviously abnormal ：

TaskManager Old age GC frequency

TaskManager Under normal circumstances , Old age GC The number of times should be a single digit , Or ten digits , But I found it thousands of times , This indicates that there is a very high memory pressure , And memory garbage can hardly be cleaned up every time , cause JVM Keep doing GC.

And we know that , Happen when GC when ,JVM There will be a stop time （Stop The World）, At this point, all threads will be suspended . If JVM Keep doing GC, Normal threads will be severely affected , Finally, the heartbeat packet fails to send out , Or the connection cannot be maintained and timeout occurs .

That's the question again ： What is the operation , Causing so much memory pressure ？

In depth analysis

Since the direct cause of the problem is found to be excessive heap memory pressure ,GC Can't clean up , That's probably what happened Memory leak The phenomenon of . The classic memory leak scenario is that the user is List、Map Wait until there are too many objects in the container , These objects are strongly referenced , Cannot be cleaned up , But it continues to occupy memory space .

This job crashes frequently , Problems continue to recur , So when a problem occurs , Get into Pod On the Heap Dump（ For example, using Java Self contained jmap command ）, And then to this Dump Document analysis ：

Use JProfiler analysis Dump Occupancy of objects in the file

It can be seen from the analysis results that , There is one HashMap Account for the 96% Space , Basically, the heap memory is full . Through communication with customers and authorization , After analyzing the program source code , It is found that this is a cache pool implemented by the user , With the continuous input of data and Watermark Step by step , The contents of the cache are dynamically replaced . When data flows in too much , If the invalid cache is not cleaned up in time , Would be right. GC Cause a lot of pressure .

It's not just ordinary container objects that have this problem ,Flink The state that comes with it （MapState、ListState etc. ） Maybe it's because Watermark Propulsion too slow 、 The window is too large 、 Multiflow JOIN Alignment of , To become extraordinarily large . If not set State TTL And so on , It can also cause JVM The instability of （ Especially in use Heap Status backend ）. So in Flink During job programming , For operations that may have a large backlog of States , Be very careful .

If it is difficult to reduce the total number of States because of business logic , We recommend using RocksDB State backend （ This is also Tencent cloud Oceanus The current default choice of the platform ）. Of course , be relative to Heap Status backend ,RocksDB The stateful backend results in higher processing latency and lower throughput , Therefore, it is necessary to select... In combination with the actual scene .

Summarize and think

In fact, this problem orientation is a detour , The initial alarm is notified in the form of log , Because there are a large number of... Before several consecutive instances fail ZooKeeper Report errors , So it is taken for granted that the focus of positioning is ZooKeeper Related components . Later, I found that other components also reported timeout , And changed the orientation to network failure , Finally, after watching the monitoring, I found that GC Caused by a pause . If before starting to locate the problem , Have a look first Flink Monitoring data , It will be easier to find the cause of the problem .

therefore , When we locate the problem , We must comprehensively analyze the indicators 、 journal 、 Collect data on environment, etc , First, distinguish which error reports and exceptions are the direct causes （ Usually the one that first happened ）, What are the indirect and secondary faults . If we were attracted to the latter from the beginning , It is very likely that a leaf will block the eyes from seeing Mount Tai , After analyzing for a long time, it is not the problem .

for example Flink Common in logs IllegalStateException: Buffer pool is destroyed、InterruptedException Waiting for a wrong report , It usually happens after other problems occur , Then we can ignore them , Continue to look for problems in earlier logs , Until the root cause is discovered .

原网站

版权声明
本文为[KyleMeow]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/09/20210912200940078N.html