当前位置:网站首页>Through the fog: location notes of Flink crash with a multi component timeout
Through the fog: location notes of Flink crash with a multi component timeout
2022-06-24 04:06:00 【KyleMeow】
Problem phenomenon
Last Thursday afternoon , The alarm system suddenly prompts that the work of a key customer frequently crashes and restarts , The phenomenon is that the job runs 2 About minutes ,JobManager Found TaskManager Loss of heartbeat , The job crashes and restarts , Seriously affect the operation of online business .
By looking at this lost connection TaskManager Log , I found that it reported a lot ZooKeeper Connection timeout error , Subsequent retries are also unsuccessful , therefore Flink Think that a serious abnormality has occurred , Voluntary order TaskManager sign out .
Preliminary positioning
Due to the observation that the job crashed more than once , By checking the previous run logs , Also saw a lot of ZooKeeper Connection timeout and error reporting , So first of all ZooKeeper Start with the server .
After investigation , Find out ZooKeeper Everything is normal on the server side , There is no error log , All indicators are in a healthy state . Besides , Suppose that if ZooKeeper If the server fails , Other jobs in the same cluster are likely to be affected , However, no errors were observed in other operations , therefore ZooKeeper The probability of server failure is very small .
And then , Analyzed the customer's JAR package , No incompatibilities have been introduced ZooKeeper Curator Version Library , Therefore, problems such as client version conflicts are basically eliminated .
That's the question : What is the reason , cause ZooKeeper Connection timeout , And the retry cannot be recovered for a long time ?
Continue to collect error information
Because the job is still crashing and restarting , From the following running examples , There are some new discoveries :Kafka、Prometheus There is also a timeout phenomenon at the monitoring uplink :
These errors reveal potential network problems , For example, the network card of the host machine where the container is located fails , There is a wide range of packet loss 、 Congestion, etc , Will cause the above error report .
But the weird thing is , fault Pod Distributed on different host nodes , And others on these host computers Pod Running normally , And the traffic monitoring of each host is also in the normal range , It doesn't seem like a problem caused by some faulty nodes .
Therefore, the conjecture of network failure has been denied again .
The immediate cause emerges
Since the probability of environmental problems is very small , That is to start with the analysis of the homework itself . By viewing the Full GC frequency , Obviously abnormal :
TaskManager Under normal circumstances , Old age GC The number of times should be a single digit , Or ten digits , But I found it thousands of times , This indicates that there is a very high memory pressure , And memory garbage can hardly be cleaned up every time , cause JVM Keep doing GC.
And we know that , Happen when GC when ,JVM There will be a stop time (Stop The World), At this point, all threads will be suspended . If JVM Keep doing GC, Normal threads will be severely affected , Finally, the heartbeat packet fails to send out , Or the connection cannot be maintained and timeout occurs .
That's the question again : What is the operation , Causing so much memory pressure ?
In depth analysis
Since the direct cause of the problem is found to be excessive heap memory pressure ,GC Can't clean up , That's probably what happened Memory leak The phenomenon of . The classic memory leak scenario is that the user is List、Map Wait until there are too many objects in the container , These objects are strongly referenced , Cannot be cleaned up , But it continues to occupy memory space .
This job crashes frequently , Problems continue to recur , So when a problem occurs , Get into Pod On the Heap Dump( For example, using Java Self contained jmap command ), And then to this Dump Document analysis :
It can be seen from the analysis results that , There is one HashMap Account for the 96% Space , Basically, the heap memory is full . Through communication with customers and authorization , After analyzing the program source code , It is found that this is a cache pool implemented by the user , With the continuous input of data and Watermark Step by step , The contents of the cache are dynamically replaced . When data flows in too much , If the invalid cache is not cleaned up in time , Would be right. GC Cause a lot of pressure .
It's not just ordinary container objects that have this problem ,Flink The state that comes with it (MapState、ListState etc. ) Maybe it's because Watermark Propulsion too slow 、 The window is too large 、 Multiflow JOIN Alignment of , To become extraordinarily large . If not set State TTL And so on , It can also cause JVM The instability of ( Especially in use Heap Status backend ). So in Flink During job programming , For operations that may have a large backlog of States , Be very careful .
If it is difficult to reduce the total number of States because of business logic , We recommend using RocksDB State backend ( This is also Tencent cloud Oceanus The current default choice of the platform ). Of course , be relative to Heap Status backend ,RocksDB The stateful backend results in higher processing latency and lower throughput , Therefore, it is necessary to select... In combination with the actual scene .
Summarize and think
In fact, this problem orientation is a detour , The initial alarm is notified in the form of log , Because there are a large number of... Before several consecutive instances fail ZooKeeper Report errors , So it is taken for granted that the focus of positioning is ZooKeeper Related components . Later, I found that other components also reported timeout , And changed the orientation to network failure , Finally, after watching the monitoring, I found that GC Caused by a pause . If before starting to locate the problem , Have a look first Flink Monitoring data , It will be easier to find the cause of the problem .
therefore , When we locate the problem , We must comprehensively analyze the indicators 、 journal 、 Collect data on environment, etc , First, distinguish which error reports and exceptions are the direct causes ( Usually the one that first happened ), What are the indirect and secondary faults . If we were attracted to the latter from the beginning , It is very likely that a leaf will block the eyes from seeing Mount Tai , After analyzing for a long time, it is not the problem .
for example Flink Common in logs IllegalStateException: Buffer pool is destroyed、InterruptedException Waiting for a wrong report , It usually happens after other problems occur , Then we can ignore them , Continue to look for problems in earlier logs , Until the root cause is discovered .
边栏推荐
- How to gracefully handle and return errors in go (1) -- error handling inside functions
- Web penetration test - 5. Brute force cracking vulnerability - (3) FTP password cracking
- Real time monitoring of water conservancy by RTU of telemetry terminal
- How to set up a web server what is the price of the server
- Why is on-line monitoring of equipment more and more valued by people?
- How to intuitively explain server hosting and leasing to enterprises?
- hprofStringCache
- C string input considerations
- How to draw the flow chart of C language structure, and how to draw the structure flow chart
- Notice on stopping maintenance of this column
猜你喜欢

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!

抢先报名丨新一代 HTAP 数据库如何在云上重塑?TiDB V6 线上发布会即将揭晓!

应用实践 | Apache Doris 整合 Iceberg + Flink CDC 构建实时湖仓一体的联邦查询分析架构

Black hat actual combat SEO: never be found hijacking

mysql - sql执行过程

Yuanqi forest pushes "youkuang", and farmers' mountain springs follow the "roll"?

共建欧拉社区 共享欧拉生态|携手麒麟软件 共创数智未来

黑帽实战SEO之永不被发现的劫持

多任务视频推荐方案,百度工程师实战经验分享

openEuler社区理事长江大勇:共推欧拉开源新模式 共建开源新体系
随机推荐
Kubernetes 资源拓扑感知调度优化
In the post epidemic era, "cloud live broadcast" saves "cloud cultural tourism"?
系统的去学习一门编程语言,原来有如此捷径
How to draw the flow chart of C language structure, and how to draw the structure flow chart
黑帽实战SEO之永不被发现的劫持
Gaussian beam and its matlab simulation
ClickHouse(02)ClickHouse架构设计介绍概述与ClickHouse数据分片设计
Installation of pytorch in pycharm
Received status code 502 from server: Bad Gateway
C language - number of bytes occupied by structure
Tencent cloud console work order submission Guide
What is a virtual host server? What are the advantages?
A problem of testing security group access in windows10
golang clean a slice
Easygbs video playback protocol only webrtc can play. Troubleshooting
Web penetration test - 5. Brute force cracking vulnerability - (7) MySQL password cracking
抢先报名丨新一代 HTAP 数据库如何在云上重塑?TiDB V6 线上发布会即将揭晓!
Web penetration test - 5. Brute force cracking vulnerability - (1) SSH password cracking
Dialogue with Google technical experts: soundstream is expected to be used for general audio coding in the future
Received status code 502 from server: Bad Gateway