当前位置:网站首页>Spark source code analysis (I): RDD collection data - partition data allocation

Spark source code analysis (I): RDD collection data - partition data allocation

2022-06-26 05:53:00 Little five

RDD What is it? : Distributed elastic datasets
Problems to be solved :rdd And how to allocate the partition data of the data source ???

for example (1,2,3,4), Partition number numSlices=3,RDD How to partition data storage in ?

Viewing the source code allows us to quickly understand .

 

 

Enter makeRDD function , See the inside is implemented parallelize function , And pass in the number of sets and partitions .

parallelize The function creates a ParallelCollectionRDD object .


And then , Get into ParallelCollectionRDD class Inside .
There is a method with the same name :
Slice the collection into numSlices subset . Another thing we're doing here is dealing with scopes, especially collections , Encode slices into other ranges to minimize memory costs . This makes it possible to represent a large number of data sets RDD Up operation Spark It's very effective . If the collection is an inclusive range , We use the include range for the last slice .

 

stay slice Functions and position After the function , Pattern matching is required .
case 1: Range Range, If the scope contains , For the last slice “ Coverage ”
case 2: For long 、 Double precision 、 Large integer, etc
case 3: other -> Conduct position function


position Function input ( The length of the set , Partition number ), And for the [0,numSlices) To iterate (until Left closed right away )
Calculate according to the rules start and end, Finally, you will get the partition rules .
// for example (1,2,3,4,5) numslices=3 -> Yes 0,1,2 iteration
// Generate three partition rules (0,1](1,3](3,5]
And then use it slice Conduct array segmentation .
That is to say (1)(2,3)(4,5)

 

原网站

版权声明
本文为[Little five]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202180504027730.html