当前位置:网站首页>Spark Tuning common configuration parameters
Spark Tuning common configuration parameters
2022-06-25 11:21:00 【pyiran】
Recently I saw a good article about Spark Memory tuned blog, Share it :
https://idk.dev/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
This article blog Mainly put forward several Spark Memory tuning method ( Based on Amazon EMR Summary of the , But I think the versatility is still very strong ), It's really a normal situation , I won't do the whole translation here , I'm here to make a summary based on my own experience , You can read the original text by yourself .
Spark Tuning basic configuration parameters
The following parameters are the most basic job Tuning parameters , Only after setting these parameters properly , We have further optimization .
property name | Default | Meaning |
---|---|---|
spark.executor.memory | 1g | Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g). |
spark.driver.memory | 1g | Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in the same format as JVM memory strings with a size unit suffix (“k”, “m”, “g” or “t”) (e.g. 512m, 2g). Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file. |
spark.executor.cores | 1 in YARN mode, all the available cores on the worker in standalone and Mesos coarse-grained modes. | The number of cores to use on each executor. In standalone and Mesos coarse-grained modes |
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. |
spark.default.parallelism | For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager | Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user. |
spark.executor.instances | 2 in Yarn mode | The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large. |
Set up timeout and retry
Sometimes in log You'll see it in the library timeout, lostNodeFail, lose connect Such keywords , We can think about configuring timeout Of threshold, Some have been added retry
More common :
property name | Default | Meaning |
---|---|---|
spark.network.timeout | 120s | Default timeout for all network interactions. This config will be used in place of spark.core.connection.ack.wait.timeout, spark.storage.blockManagerSlaveTimeoutMs, spark.shuffle.io.connectionTimeout, spark.rpc.askTimeout or spark.rpc.lookupTimeout if they are not configured. |
spark.executor.heartbeatInterval | 10s | Interval between each executor’s heartbeats to the driver. Heartbeats let the driver know that the executor is still alive and update it with metrics for in-progress tasks. spark.executor.heartbeatInterval should be significantly less than spark.network.timeout |
spark.task.maxFailures | 4 | Number of failures of any particular task before giving up on the job. The total number of failures spread across different tasks will not cause the job to fail; a particular task has to fail this number of attempts. Should be greater than or equal to 1. Number of allowed retries = this value - 1. |
spark.shuffle.io.maxRetries | 3 | (Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues. |
spark.yarn.maxAppAttempts | yarn.resourcemanager.am.max-attempts in YARN | The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration. |
spark.rpc.numRetries | 3 | Number of times to retry before an RPC task gives up. An RPC task will run at most times of this number. |
spark.rpc.retry.wait | 3s | Duration for an RPC ask operation to wait before retrying. |
Set up JVM Parameters help tune
JVM All parameters of can be passed through spark.executor.extraJavaOptions/spark.driver.extraJavaOptions To set up . In fact, we sometimes find our job Suddenly running very slowly , On the one hand, you can go and have a look Yarn Resource allocation on , On the other hand, I can't see if I have a lot of time to do GC As a result of .
What is given here example Well , It's set up G1 To carry out gc, And print some GC Of log.
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC",
spark.dynamicAllocation.enabled
This is Spark It has a function , Generally speaking, it's yours spark Job It has been tuned very well , You want to make your own cluster With better resource utilization , Will go to enable Of , Advanced tuning .
About features , I will make a separate introduction later .
边栏推荐
- 金仓数据库 KingbaseES 插件force_view
- Vulnérabilité à l'injection SQL (contournement)
- Big Endian 和 Little Endian
- 10.1. Oracle constraint deferred, not deferred, initially deferred and initially deferred
- 金仓数据库 KingbaseES 插件DBMS_UTILITY
- 一个数学难题,难倒两位数学家
- GaussDB 集群维护案例集-sql执行慢
- Double tampon transparent cryptage et décryptage basé sur le cadre minifilter
- GC
- Comparison between relu and SIGMOD
猜你喜欢
Ladder side tuning: the "wall ladder" of the pre training model
SQL注入漏洞(繞過篇)
Design and implementation of university laboratory goods management information system based on SSH
Geographic location system based on openstreetmap+postgis paper documents + reference papers + project source code and database files
网易开源的分布式存储系统 Curve 正式成为 CNCF 沙箱项目
Leetcode 1249. 移除无效的括号(牛逼,终于做出来了)
开源社邀请您参加OpenSSF开源安全线上研讨会
数据库系列:MySQL索引优化总结(综合版)
CMU提出NLP新范式—重构预训练,高考英语交出134高分
At 16:00 today, Mr. sunxiaoming, a researcher of the Institute of computing, Chinese Academy of Sciences, took you into the quantum world
随机推荐
每日3题(2)- 找出数组中的幸运数
中国信通院沈滢:字体开源协议——OFL V1.1介绍及合规要点分析
Android: generic mapping analysis of gson and JSON in kotlin
今天16:00 | 中科院计算所研究员孙晓明老师带大家走进量子的世界
过拟合原因及解决
16 种企业架构策略
某APP中模拟器检测分析
Multiple environment variables
GaussDB others内存比较高的场景
[file inclusion vulnerability-04] classic interview question: how to getshell when a website is known to have only local file inclusion vulnerability?
How to start the phpstudy server
动态规划解决股票问题(上)
基于OpenStreetMap+PostGIS的地理位置系统 论文文档+参考论文文献+项目源码及数据库文件
ARM64汇编的函数有那些需要注意?
Is it safe for Guosen Securities to open a securities account
Kingbasees plug-in DBMS of Jincang database_ RANDOM
金仓数据库 KingbaseES 插件dbms_session
A difficult mathematical problem baffles two mathematicians
每日3題(3)-檢查整數及其兩倍數是否存在
Use of comparable (for arrays.sort)