当前位置：网站首页>Flink on Yan ha mode construction problem

Flink on Yan ha mode construction problem

2022-07-16 07:33:00 【Vision-wang】

1. Problem description

Hadoop edition 2.6.5 Flink edition 1.11.6
In front of , In the building Standalone There is no problem during cluster startup
In the building Flink On Yarn when , There's no problem
build Flink On Yarn But the implementation has failed

2.Flink On Yarn HA To configure

1. to yarn-site.xml： Add the following configuration to the file
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>10</value>
</property>
This means to resourcemanager Submit tasks ,Yarn Give us the number of automatic retries after maintenance failure

2. modify FLINK_HOME/conf Under the flink-conf.yaml file ：
Make sure to see the last , So equipped with problems ！！！！！
   high-availability: zookeeper
   high-availability.storageDir: hdfs://node02:9000/flink/ha/ What I'm using here is node02
   high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181

3. Download support Hadoop Plug in and copy to each node FLINK_HOME/lib Under the table of contents , Because in Flink This version no longer carries and by default Hadoop The interaction of jar file , You need to import
Download address :
https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.6.5-10.0/flink-shaded-hadoop-2-uber-2.6.5-10.0.jarz

3. Recurrence of solution process

First check the command line

2022-07-05 20:43:01,918 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                [] - Error while running the Flink session.
org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
	at org.apache.flink.yarn.YarnClusterDescriptor.deploySessionCluster(YarnClusterDescriptor.java:392) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:636) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$4(FlinkYarnSessionCli.java:895) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:895) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 
Diagnostics from YARN: Application application_1657020068308_0004 failed 2 times due to AM Container for appattempt_1657020068308_0004_000004 exited with  exitCode: 1
For more detailed output, check application tracking page:http://node05:8088/proxy/application_1657020068308_0004/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1657020068308_0004_04_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
	at org.apache.hadoop.util.Shell.run(Shell.java:455)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is an error from the command line , It roughly means container container Startup error , Why is the mistake not clearly described , He said that the detailed information can be checked through the following address http://node05:8088/proxy/application_1657020068308_0004/Then

View container log

Found no information for birds , The same as the command line
But you can see that the container has tried many times when it started , You can see why it fails every time through the log of each attempt to start

Check the retry log

I have one here error, I instinctively check this log

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1657020068308_0004/filecache/15/log4j-slf4j-impl-2.16.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/bigdata/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

What is described here is a SLF4j Package version conflict log for , Then find the specific location of the two packages , Replace, rename, replace, and a series of operations , And then I found this error It's gone , however The cluster is not up yet ！！！！！！
It took me a long time to realize that I missed a log , It's careless , Instinctive reactions are likely to lead us astray

see jobmanager.log Full log

I can't see it all , Point here , On the surface, there is nothing wrong

org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnSessionClusterEntrypoint.
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:577) [flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:82) [flink-dist_2.11-1.11.6.jar:1.11.6]
Caused by: java.net.ConnectException: Call From node04/192.168.76.96 to node03:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_152]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_152]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_152]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_152]
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.Client.call(Client.java:1474) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.Client.call(Client.java:1401) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at com.sun.proxy.$Proxy26.mkdirs(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_152]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_152]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_152]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_152]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at com.sun.proxy.$Proxy27.mkdirs(Unknown Source) ~[?:?]
	at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2742) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2713) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:172) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.FileSystemBlobStore.<init>(FileSystemBlobStore.java:64) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.BlobUtils.createFileSystemBlobStore(BlobUtils.java:98) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.blob.BlobUtils.createBlobStoreFromConfig(BlobUtils.java:76) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.highavailability.HighAvailabilityServicesUtils.createHighAvailabilityServices(HighAvailabilityServicesUtils.java:115) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.createHaServices(ClusterEntrypoint.java:335) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:293) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:223) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:177) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_152]
	at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_152]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) ~[flink-shaded-hadoop-2-uber-2.6.5-10.0.jar:2.6.5-10.0]
	at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:174) ~[flink-dist_2.11-1.11.6.jar:1.11.6]
	... 2 more
Caused by: java.net.ConnectException: Connection refused

We got a problem ！！ Can't connect HDFS As a result of , Because we are HA Configuration of the HDFS Address ,ZK Address, etc

3. solve

   high-availability: zookeeper
   high-availability.storageDir: hdfs://mycluster/flink/ha/
Here, here , adopt IP+ The port is inaccessible HDFS Of
   high-availability.zookeeper.quorum: node02:2181,node03:2181,node04:2181

Because we are hdfs-site.xml The following configurations have been made in , So you need to write the address as mycluster, The cluster helps us automatically find HDFS in NameNode The address of

		<!-- Seek NN Configuration file for ,k-v mapping -->
		<property>
		  <name>dfs.nameservices</name>
		  <value>mycluster</value>
		</property>
		<property>
		  <name>dfs.ha.namenodes.mycluster</name>
		  <value>nn1,nn2</value>
		</property>
		<property>
          <!--NN The specific address of -->
		  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
		  <value>node01:8020</value>
		</property>
		<property>
		  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
		  <value>node02:8020</value>
		</property>

4. summary

There are many cluster configurations , There are also many logs , Troubleshooting is complicated , Don't give up. Look slowly and you'll always find , Especially learning to build by yourself , Many configuration methods on the network are wrong , It may only be with what you built before HDFS,Yarn Cluster configuration is conflicting , Troubleshooting all components as a whole is also a great progress ！！！

原网站

版权声明
本文为[Vision-wang]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207131738230919.html