当前位置:网站首页>Through the fog: location notes of Flink crash with a multi component timeout
Through the fog: location notes of Flink crash with a multi component timeout
2022-06-24 04:06:00 【KyleMeow】
Problem phenomenon
Last Thursday afternoon , The alarm system suddenly prompts that the work of a key customer frequently crashes and restarts , The phenomenon is that the job runs 2 About minutes ,JobManager Found TaskManager Loss of heartbeat , The job crashes and restarts , Seriously affect the operation of online business .
By looking at this lost connection TaskManager Log , I found that it reported a lot ZooKeeper Connection timeout error , Subsequent retries are also unsuccessful , therefore Flink Think that a serious abnormality has occurred , Voluntary order TaskManager sign out .
Preliminary positioning
Due to the observation that the job crashed more than once , By checking the previous run logs , Also saw a lot of ZooKeeper Connection timeout and error reporting , So first of all ZooKeeper Start with the server .
After investigation , Find out ZooKeeper Everything is normal on the server side , There is no error log , All indicators are in a healthy state . Besides , Suppose that if ZooKeeper If the server fails , Other jobs in the same cluster are likely to be affected , However, no errors were observed in other operations , therefore ZooKeeper The probability of server failure is very small .
And then , Analyzed the customer's JAR package , No incompatibilities have been introduced ZooKeeper Curator Version Library , Therefore, problems such as client version conflicts are basically eliminated .
That's the question : What is the reason , cause ZooKeeper Connection timeout , And the retry cannot be recovered for a long time ?
Continue to collect error information
Because the job is still crashing and restarting , From the following running examples , There are some new discoveries :Kafka、Prometheus There is also a timeout phenomenon at the monitoring uplink :
These errors reveal potential network problems , For example, the network card of the host machine where the container is located fails , There is a wide range of packet loss 、 Congestion, etc , Will cause the above error report .
But the weird thing is , fault Pod Distributed on different host nodes , And others on these host computers Pod Running normally , And the traffic monitoring of each host is also in the normal range , It doesn't seem like a problem caused by some faulty nodes .
Therefore, the conjecture of network failure has been denied again .
The immediate cause emerges
Since the probability of environmental problems is very small , That is to start with the analysis of the homework itself . By viewing the Full GC frequency , Obviously abnormal :
TaskManager Under normal circumstances , Old age GC The number of times should be a single digit , Or ten digits , But I found it thousands of times , This indicates that there is a very high memory pressure , And memory garbage can hardly be cleaned up every time , cause JVM Keep doing GC.
And we know that , Happen when GC when ,JVM There will be a stop time (Stop The World), At this point, all threads will be suspended . If JVM Keep doing GC, Normal threads will be severely affected , Finally, the heartbeat packet fails to send out , Or the connection cannot be maintained and timeout occurs .
That's the question again : What is the operation , Causing so much memory pressure ?
In depth analysis
Since the direct cause of the problem is found to be excessive heap memory pressure ,GC Can't clean up , That's probably what happened Memory leak The phenomenon of . The classic memory leak scenario is that the user is List、Map Wait until there are too many objects in the container , These objects are strongly referenced , Cannot be cleaned up , But it continues to occupy memory space .
This job crashes frequently , Problems continue to recur , So when a problem occurs , Get into Pod On the Heap Dump( For example, using Java Self contained jmap command ), And then to this Dump Document analysis :
It can be seen from the analysis results that , There is one HashMap Account for the 96% Space , Basically, the heap memory is full . Through communication with customers and authorization , After analyzing the program source code , It is found that this is a cache pool implemented by the user , With the continuous input of data and Watermark Step by step , The contents of the cache are dynamically replaced . When data flows in too much , If the invalid cache is not cleaned up in time , Would be right. GC Cause a lot of pressure .
It's not just ordinary container objects that have this problem ,Flink The state that comes with it (MapState、ListState etc. ) Maybe it's because Watermark Propulsion too slow 、 The window is too large 、 Multiflow JOIN Alignment of , To become extraordinarily large . If not set State TTL And so on , It can also cause JVM The instability of ( Especially in use Heap Status backend ). So in Flink During job programming , For operations that may have a large backlog of States , Be very careful .
If it is difficult to reduce the total number of States because of business logic , We recommend using RocksDB State backend ( This is also Tencent cloud Oceanus The current default choice of the platform ). Of course , be relative to Heap Status backend ,RocksDB The stateful backend results in higher processing latency and lower throughput , Therefore, it is necessary to select... In combination with the actual scene .
Summarize and think
In fact, this problem orientation is a detour , The initial alarm is notified in the form of log , Because there are a large number of... Before several consecutive instances fail ZooKeeper Report errors , So it is taken for granted that the focus of positioning is ZooKeeper Related components . Later, I found that other components also reported timeout , And changed the orientation to network failure , Finally, after watching the monitoring, I found that GC Caused by a pause . If before starting to locate the problem , Have a look first Flink Monitoring data , It will be easier to find the cause of the problem .
therefore , When we locate the problem , We must comprehensively analyze the indicators 、 journal 、 Collect data on environment, etc , First, distinguish which error reports and exceptions are the direct causes ( Usually the one that first happened ), What are the indirect and secondary faults . If we were attracted to the latter from the beginning , It is very likely that a leaf will block the eyes from seeing Mount Tai , After analyzing for a long time, it is not the problem .
for example Flink Common in logs IllegalStateException: Buffer pool is destroyed、InterruptedException Waiting for a wrong report , It usually happens after other problems occur , Then we can ignore them , Continue to look for problems in earlier logs , Until the root cause is discovered .
边栏推荐
- 你了解TLS协议吗?
- Common content of pine script script
- How to bypass CDN to get web pages? How many options are available?
- Configuration process of easygbs access to law enforcement recorder
- golang clean a slice
- TCP three handshakes and four waves
- Old popup explorer Exe has stopped working due to problems. What should I do?
- Submit sitemap to Baidu
- How much space does structure variable occupy in C language
- RPM 包的构建 - SPEC 基础知识
猜你喜欢

黑帽SEO实战之目录轮链批量生成百万页面

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!

Halcon knowledge: contour operator on region (2)
![[code Capriccio - dynamic planning] t392 Judgement subsequence](/img/59/9da6d70195ce64b70ada8687a07488.png)
[code Capriccio - dynamic planning] t392 Judgement subsequence

Installation of pytorch in pycharm

Kubernetes 资源拓扑感知调度优化

共建欧拉社区 共享欧拉生态|携手麒麟软件 共创数智未来

Idea 1 of SQL injection bypassing the security dog

Flutter series: offstage in flutter

多任务视频推荐方案,百度工程师实战经验分享
随机推荐
开源之夏2022中选结果公示,449名高校生将投入开源项目贡献
openGauss 3.0版本源码编译安装指南
3. go deep into tidb: perform optimization explanation
What is a virtual host server? What are the advantages?
Psexec right raising
golang clean a slice
How to draw the flow chart of C language structure, and how to draw the structure flow chart
How much space does structure variable occupy in C language
Getlocationinwindow source code
How to adjust the incompleteness before and after TS slicing of easydss video recording?
Several options of F8 are very useful
In the post epidemic era, "cloud live broadcast" saves "cloud cultural tourism"?
Optimization of digital transformation management of procurement platform in construction industry
What is pseudo static? How to configure the pseudo static server?
Can the video streams of devices connected to easygbs from the intranet and the public network go through their respective networks?
LeetCode 1281. Difference of sum of bit product of integer
Pine Script脚本常用内容
抢先报名丨新一代 HTAP 数据库如何在云上重塑?TiDB V6 线上发布会即将揭晓!
Go operation mongodb
C string input considerations