当前位置：网站首页>Spark common interview questions sorting

Spark common interview questions sorting

2022-07-23 11:14:00 【I'm a girl, I don't program yuan】

List of articles

Data skew
Spark Run the architecture
Wide dependence and narrow dependence

Data skew

What is data skew
In the big data system of parallel processing , Some part (Partition) The data volume of is significantly larger than that of other parts , The data processing speed of this part becomes the bottleneck of data set processing .
Why data skew
same Stage Different from task There are significant differences in the amount of data processed , Some task The amount of data processed is significantly larger than others task.
How to solve the problem of data skew
① Improve shuffle Parallelism of operations
for fear of task Less leads to more key Assigned to the same task And uneven distribution , Can be improved properly task The number of （ But it can't solve one key The amount of data is significantly larger than others key The problem of the situation ）

② Two stage polymerization （ by key Add random prefix / suffix ）
If a key Too much data , such as key by ‘hello’ There are a lot of data , You can prefix it , Turn into ：0_hello,1_hello,2_hello…
Then assign them to different task On , Local polymerization , Finally, conduct global aggregation , Thorough solution key The problem of uneven data volume .
③ take reduce join To map join, eliminate shuffle Data skew
reduce join: Suitable for two big watches join,map The stage retains all the information of the two tables , With join The key word for is key After distributed computing ,shuffle and reduce Phase in join.
reduce join Of map The data is not slimmed down at this stage ,reduce Multiply two tables , Memory consumption is large .
Insert picture description here
map join： Suitable for a small watch join A big watch , Load all the data of the small table into the node memory ,join After that, there was no reduce operation .
map join There must be a table small enough , It can be done in memory join operation .
④ Delete fields with too much data , Such as company_id=0 Corresponding to hundreds of millions of data , Other corresponding single digits , be company_id=0 The allocated node will overflow memory and cause the program to crash , Deleting the outlier can quickly execute .

Spark Run the architecture

Spark The components of include ：driver and executor （ All are process ）, Responsible for the use of resources .
Drive nodes （driver）：
perform main() Process of method , structure sparkContext object , Be responsible for dividing user programs into multiple stage（ Serial ）, And put each stage Decompose into multiple tasks task（ parallel ）, Assign tasks to each actuator and schedule task execution ;
Once the drive terminates ,Spark End of application .
Actuator node （executor）：
Responsible for the operation task , take rdd Cache in the actuator process , And return the result to the drive process
Actuators are independent of each other , If an actuator node crashes ,Spark The application can continue to execute .
Master node （master） And work nodes （worker）（ yes Physical nodes ）, Responsible for resource management and allocation .
One machine can act as master and worker node .
Master node （master）：
Responsible for managing the worker node ,executor towards master Request resources .
Work node （worker）：
Responsible for managing the executor process . One Worker Assign a by default Executor, When configuring, you can also configure multiple Executor.

Wide dependence and narrow dependence

1. Narrow dependence
Narrow dependence refers to the father RDD Each partition of is only one child RDD Partition use , Son RDD Partitions usually only correspond to constant parents RDD Partition
Insert picture description here
2. Wide dependence
Wide dependence refers to the father RDD Each partition of can be divided into several sub sections RDD Partition use , Son RDD The partition usually corresponds to the parent RDD All divisions

3. The difference between wide dependence and narrow dependence
Wide dependence often corresponds to shuffle operation , You need to run the same RDD Partition into different RDD partition , The middle may involve the transmission of data between multiple nodes , And every parent of narrow dependence RDD Partitions are usually passed into another child RDD Partition , It is usually done in one node .
When RDD When a partition is lost , For narrow dependence , Because of the father RDD A partition of is corresponding to only one child RDD Partition , In this way, you only need to recalculate and RDD The parent of the partition RDD Just partition . The use of data in this calculation is 100% Of
When RDD When a partition is lost , For wide dependence , Recalculate father RDD Only part of the data in the partition corresponds to the lost child RDD The partition , The other part causes redundant calculations . Children in wide dependence RDD Partitions usually come from multiple parents RDD Partition , In extreme cases , All the fathers RDD It is possible to recalculate .
4. Corresponding function
Narrow dependent functions are ：
map, filter, union, join( Father RDD yes hash-partitioned ), mapPartitions, mapValues
The functions with wide dependence are ：
groupByKey, join( Father RDD No hash-partitioned ), partitionBy