当前位置：网站首页>Spark Tuning common configuration parameters

Spark Tuning common configuration parameters

2022-06-25 11:21:00 【pyiran】

Recently I saw a good article about Spark Memory tuned blog, Share it ：
https://idk.dev/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
This article blog Mainly put forward several Spark Memory tuning method （ Based on Amazon EMR Summary of the , But I think the versatility is still very strong ）, It's really a normal situation , I won't do the whole translation here , I'm here to make a summary based on my own experience , You can read the original text by yourself .

Spark Tuning basic configuration parameters

The following parameters are the most basic job Tuning parameters , Only after setting these parameters properly , We have further optimization .

property name	Default	Meaning
spark.executor.memory	1g	Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g).
spark.driver.memory	1g	Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.
spark.executor.cores	1 in YARN mode, all the available cores on the worker in standalone and Mesos coarse-grained modes.	The number of cores to use on each executor. In standalone and Mesos coarse-grained modes
spark.driver.cores	1	Number of cores to use for the driver process, only in cluster mode.
spark.default.parallelism	For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager	Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
spark.executor.instances	2 in Yarn mode	The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large.

Set up timeout and retry

Sometimes in log You'll see it in the library timeout, lostNodeFail, lose connect Such keywords , We can think about configuring timeout Of threshold, Some have been added retry
More common ：

property name	Default	Meaning
spark.network.timeout	120s	Default timeout for all network interactions. This config will be used in place of spark.core.connection.ack.wait.timeout, spark.storage.blockManagerSlaveTimeoutMs, spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or spark.rpc.lookupTimeout if they are not configured.
spark.executor.heartbeatInterval	10s	Interval between each executor’s heartbeats to the driver. Heartbeats let the driver know that the executor is still alive and update it with metrics for in-progress tasks. spark.executor.heartbeatInterval should be significantly less than spark.network.timeout
spark.task.maxFailures	4	Number of failures of any particular task before giving up on the job. The total number of failures spread across different tasks will not cause the job to fail; a particular task has to fail this number of attempts. Should be greater than or equal to 1. Number of allowed retries = this value - 1.
spark.shuffle.io.maxRetries	3	(Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues.
spark.yarn.maxAppAttempts	yarn.resourcemanager.am.max-attempts in YARN	The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration.
spark.rpc.numRetries	3	Number of times to retry before an RPC task gives up. An RPC task will run at most times of this number.
spark.rpc.retry.wait	3s	Duration for an RPC ask operation to wait before retrying.

Set up JVM Parameters help tune

JVM All parameters of can be passed through spark.executor.extraJavaOptions/spark.driver.extraJavaOptions To set up . In fact, we sometimes find our job Suddenly running very slowly , On the one hand, you can go and have a look Yarn Resource allocation on , On the other hand, I can't see if I have a lot of time to do GC As a result of .
What is given here example Well , It's set up G1 To carry out gc, And print some GC Of log.

"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC",

spark.dynamicAllocation.enabled

This is Spark It has a function , Generally speaking, it's yours spark Job It has been tuned very well , You want to make your own cluster With better resource utilization , Will go to enable Of , Advanced tuning .
About features , I will make a separate introduction later .

原网站

版权声明
本文为[pyiran]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202200540009936.html