当前位置:网站首页>active RM机子断电后,RM HA切换正常。但是YarnUI上查看不到集群资源,application也一直处于ACCEPTED状态。
active RM机子断电后,RM HA切换正常。但是YarnUI上查看不到集群资源,application也一直处于ACCEPTED状态。
2022-06-22 04:05:00 【龟速扣代码】
目录
问题表现:
Active RM所在的机子断电,在ambari上看到了1分钟左右 RM HA主备切换成功了。即standby RM变成了 active RM
访问YarnUi上,查看集群的信息,发现memory和cores都是0. 但是ambari上看到集群的机子都正常运行。
application在yarnUi上看一直处于ACCEPTED状态,但是查看数据库一直有新的数据存入。
在等待15分钟左右后,application切换成了 RUNNING状态,yarnUI也能正常显示出集群的资源情况

排查思路:
判断是application问题 还是 yarn问题 导致
从现状3查看,application能正常处理存储数据,暂时可以排除application原因。yarn RM切换正常,但是并未显示集群资源信息。推测是不是RM和NM的通讯问题
在官网查查找对应的RM和NM通讯的参数并设置,未得到解决。
yarn.nodemanager.resourcemanager.connect.max-wait.ms 900000(默认值)
yarn.resourcemanager.connect.max-wait.ms 900000
yarn.resourcemanager.container.liveness-monitor.interval-ms 600000 查看 RM和NM的日志(DEBUG级别), 有IPC相关的一场INFO。但是没有明确是否是ERROR,再查看application日志,发现有socket链接超时。可以往IPC,SOCKET相关参数查找。
2019-12-26 04:31:16,736 INFO amlauncher.AMLauncher (AMLauncher.java:run(320)) - Error cleaning master
javax.security.sasl.SaslException: DIGEST-MD5: digest response format violation. Mismatched response. [Caused by org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy97.stopContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:143)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:318)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy96.stopContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:142)
... 15 more
application日志,提示 socket链接超时。超时时间为20000ms
2019-12-26 09:23:47,051 DEBUG [main] RetryInvocationHandler:413 - org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout, while invoking ApplicationMasterProtocolPBClientImpl.registerApplicationMaster over rm1. Trying to failover immediately.2019-12-26 09:23:47,051 DEBUG [main] RetryInvocationHandler:413 - org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout, while invoking ApplicationMasterProtocolPBClientImpl.registerApplicationMaster over rm1. Trying to failover immediately.org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)去hadoop官网查看IPC相关的参数,在core.xml文件中查看到比较符合要求的.
ipc.client.connect.timeout 20000 客户端等待socket建立连接的时间
ipc.client.connect.max.retries.on.timeouts 45 客户端链接socket超时,重试次数去减少两个参数值后,yarnUI如预期生效。可以确认 是这两个参数。且在这过程中,在社区搜索到类似的问题描述
https://issues.apache.org/jira/browse/HADOOP-11252
https://issues.apache.org/jira/browse/YARN-2578
其中的解释是 ipc.client.rpc-timeout.ms 设置为0后,网络链接是不会超时,那么会降低为tcp级别重试链接。
问题解决
边栏推荐
- The continuous function of pytoch
- 图的DFS
- processes 不够用了 ,用户登陆不了
- Kubernetes 集群日志管理
- How far is the memory computing integrated chip from popularization? Listen to what practitioners say | collision school x post friction intelligence
- Empty, isset and is of PHP_ Null difference
- be based on. NETCORE development blog project starblog - (12) razor page dynamic compilation
- 基于SSM的博客系统【带后台管理】
- TCL Huaxing released the world's first 0.016hz ultra-low frequency OLED wearable device screen
- BFs of figure
猜你喜欢

Flutter performance optimization

Topological sorting

详细专业的软件功能测试报告应该怎样书写

Mqtt of NLog custom target

使用Expanded布局时报错The following assertion was thrown during performLayout

It is about one-step creating Yum source cache in Linux

拓扑排序

Storage structure of tree

Laravel implements soft deletion

【牛客刷题-SQL大厂面试真题】NO1.某音短视频
随机推荐
FaceShifter.ipynb
Existing requirements synchronize other database user information to our system. Their primary key ID is string and our primary key is long
拓扑排序
Bubble sort
How to write a detailed and professional software function test report
【BP回归预测】基于matlab GA优化BP回归预测(含优化前的对比)【含Matlab源码 1901期】
低功耗雷达感应模组,智能锁雷达感应方案应用,智能雷达传感器技术
Twitter如何去中心化?看看这十个SocialFi项目
processes 不够用了 ,用户登陆不了
Key program of TwinCAT 3 RS232 communication
Low power radar sensing module, application of smart lock radar sensing scheme, smart radar sensor technology
Redis和MySQL如何保持数据一致性?强一致性,弱一致性,最终一致性
Fatal NI connect error 12170. Error reporting processing
Idea blue screen solution
La première femme vient à 3A? Est - ce que Bright name, approuvé par IGN, peut mettre en évidence le classement
KS004 基于SSH通讯录系统设计与实现
Beifu twincat3 ads error query list
使用Expanded布局时报错The following assertion was thrown during performLayout
Convenient and easy to master, vivo intelligent remote control function realizes the control of household appliances in the whole house
How far is the memory computing integrated chip from popularization? Listen to what practitioners say | collision school x post friction intelligence