当前位置:网站首页>Spark Learning: how to choose different association forms and mechanisms?
Spark Learning: how to choose different association forms and mechanisms?
2022-07-24 09:35:00 【I love evening primrose a】
Spark Sql in join There are many kinds of , Here, it is divided into correlation form and time mechanism
One 、 Data preparation
eg:
import spark.implicits._
import org.apache.spark.sql.DataFrame
// Create an employee table
val seq = Seq((1,"li",20,"Male"),(2,"shi",22,"Female"),(3,"ming",24,"Female"))
val employees:DataFrame = seq.toDF("id","name","age","gender")
// Create salary table
val seq2 = Seq((1,20000),(2,26000),(3,24000),(4,30000))
val salaries:DataFrame = seq2.toDF("id","salary")
Two 、 Association form
| Association form | keyword |
|---|---|
| Internal connection | inner |
| Left outer connection | left/leftouter/left_outer |
| Right external connection | right/rightouter/right_outer |
| All external connections | outer/full/fullouter/full_outer |
| Left half Association | leftsemi/left_semi |
| Left inverse association | leftanti/left_anti |
Internal connection 、 Left outer connection 、 Right external connection 、 External connection and sql The same in grammar , I won't elaborate here
1、 Left half Association :
The result set of left semi association is actually a subset of inner Association , It will only keep those data records that meet the association conditions in the left table
eg:
val leftsemijoinDF:DataFrame = salaries.join(employees,salaries("id") === employees("id"),"leftsemi")
leftsemijoinDF.show
As shown in the figure ,salaries There are three pieces of data and employees It matches , The left outer Association will only retain the data matching the left table and the right table
2、 Left inverse association
After understanding the left half Correlation , Left inverse correlation is also easy to understand , Similarly, only the left data is retained , But only keep the unmatched data
eg:
val leftantijoinDF:DataFrame = salaries.join(employees,salaries("id") === employees("id"),"leftanti")
leftantijoinDF.show
3、 ... and 、 Implementation mechanism
1、NLJ(Nested Loop Join) Nested loop connection
- For the two tables involved in association , Such as salaries and employees, In the order in which the codes appear , Generally put salaries Called drive table ,employees Called the base table , stay NLJ Under the implementation mechanism of , The algorithm can be used outside 、 Two nested for loop , To scan the data records of the drive table and the base table in turn , While scanning , It also determines whether the correlation condition is true , As the example above shows , On the outside floor for Cycle first scan the drive table , Scan to id by 1 One hour of , The nested inner loop is also scanning , When the internal circulation is also scanned 1 when , Will match the output , And so on , Judge the recorded in turn id Whether the field meets the conditions
Suppose the drive table has m Bar record , The base table has n Bar record , that NLJ The computational complexity of the algorithm is O(mn), Poor execution efficiency *
2、SMJ(Sort Merge Join) Sort merge connections
- SMJ Will sort the two tables first , Then use a separate cursor , Merge and associate the two sorted tables The specific calculation process :
At first , The cursors of the drive table and the base table will be on their first record , Then, by comparing the record where the cursor is located id value , To determine the next step
a、 Satisfy the correlation conditions , On both sides id The values are equal , At this time, the data records on both sides are spliced and output , Then slide the cursor of the drive table to the next record
b、 The association condition is not satisfied , The driver table id The value is less than the base table id value , Now slide the cursor of the drive table to the next record
c、 The association condition is not satisfied , The driver table id The value is greater than the base table id value , Now slide the cursor of the base table to the next record
therefore ,SMJ The computational complexity of the algorithm is O(m+n), But the computational complexity is reduced , It depends on arranging the order first , This is a time-consuming operation , You can sort by indexing
3、HJ(Hash Join) Hash join ( Space for time )
HJ The calculation is divided into two stages , Namely Bulid Phase and Probe Stage
- a、 stay Bulid Stage , Above the base table , The algorithm uses the established hash function to build a hash table , In the hash table key yes id Field applies the hash value of the hash function , And hash table value It also includes the original Join
Key and Payload, As shown in the figure below

- b、 stay Probe Stage , The algorithm traverses each data record of the driving table in turn , First use the same Hash function , Calculate in a dynamic manner Join
Key Hash value of , then , The algorithm then uses the hash value to query just in Bulid Hash table created in phase , If the query fails , This indicates that there is no association between this record and the data in the base table ; contrary , If the query is successful , Then continue to compare the two sides Join
Key, If Join Key Agreement , Just splice the records on both sides for output , So as to carry out data association
Four 、NLJ/SMJ/HJ The advantages and disadvantages of
1、hash join The most efficient execution , however , stay Probe Stage before enjoying performance bonus ,Bulid Stage I have to build a hash table in memory , therefore hash join High memory requirements , It is suitable for computing scenarios where the memory can hold base table data
2、sort merge join There is no memory limitation , Disk can be used for sorting or merging , The applicable scenario is participation join The table of is an ordered table
3、nested loop join Poor efficiency , But it can deal with unequal Correlation , and hash join and sort merge join Can only handle equivalent connections
5、 ... and 、 Form of association in distributed environment ???
Form of association in distributed environment
边栏推荐
- Nuggets manufacturing industry, digital commerce cloud supply chain collaborative management system to achieve full chain intelligent management and control
- MySQL基础篇(一)-- SQL基础
- web安全入门-开源防火墙Pfsense安装配置
- PHP Basics - session control - cookies
- Excuse me, what are the financial products with an annual interest rate of 6%
- Why does TCP shake hands three times instead of two times (positive version)
- CodeBlocks shortcut key operation Xiaoquan
- Getting started with sorting - insert sorting and Hill sorting
- Detailed explanation of the whole process of R & D demand splitting | agile practice
- Protocol buffers 的问题和滥用
猜你喜欢
![[MySQL] - deep understanding of index](/img/a6/6ca1356fe11bd33ec7362ce7cdc652.png)
[MySQL] - deep understanding of index

JS locate Daquan to get the brother, parent and child elements of the node, including robot instances

【笔记】什么是内核/用户空间 从CPU如何运行程序讲起
![[assembly language practice] (II). Write a program to calculate the value of expression w=v- (x+y+z-51) (including code and process screenshots)](/img/0d/c8042ce5019d7dc5414dde5ebf1e8d.jpg)
[assembly language practice] (II). Write a program to calculate the value of expression w=v- (x+y+z-51) (including code and process screenshots)

Opencv learning Day5
![[leetcode] 31. Next arrangement](/img/83/50a3cc17fc252582458bf32d1dd36b.png)
[leetcode] 31. Next arrangement

The difference between & &, | and |

Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro

Detailed LinkedList

唐宇迪opencv-背景建模
随机推荐
Re6: reading paper licin: a heterogeneous graph based approach for automatic legal stat identification fro
The difference between classification and regression
数据中台:始于阿里,兴于DaaS
One click openstack single point mode environment deployment - preliminary construction
[don't bother with intensive learning] video notes (III) 1. What is SARS?
[don't bother with reinforcement learning] video notes (I) 2. Summary of reinforcement learning methods
Android system security - 5.2-apk V1 signature introduction
[leetcode] 31. Next arrangement
[don't bother with reinforcement learning] video notes (I) 1. What is reinforcement learning?
Get the historical quotation data of all stocks
Understanding of magnetic parameters in Hall sensors
[don't bother to strengthen learning] video notes (IV) 2. Dqn realizes maze walking
Replace the function of pow with two-dimensional array (solve the time overrun caused by POW)
At the moment of the epidemic, we need to work harder, aoligui
CodeBlocks shortcut key operation Xiaoquan
ASI-20220222-Implicit PendingIntent
Gin framework uses session and redis to realize distributed session & Gorm operation mysql
Getting started with sorting - insert sorting and Hill sorting
(5) Cloud integrated gateway gateway +swagger documentation tool
Basic knowledge of PHP - complete collection of PHP functions