当前位置：网站首页>Spark source code analysis (I): RDD collection data - partition data allocation

Spark source code analysis (I): RDD collection data - partition data allocation

2022-06-26 05:53:00 【Little five】

RDD What is it? ： Distributed elastic datasets
Problems to be solved ：rdd And how to allocate the partition data of the data source ？？？

for example （1,2,3,4）, Partition number numSlices=3,RDD How to partition data storage in ？

Viewing the source code allows us to quickly understand .

Enter makeRDD function , See the inside is implemented parallelize function , And pass in the number of sets and partitions .

parallelize The function creates a ParallelCollectionRDD object .

And then , Get into ParallelCollectionRDD class Inside .
There is a method with the same name ：
Slice the collection into numSlices subset . Another thing we're doing here is dealing with scopes, especially collections , Encode slices into other ranges to minimize memory costs . This makes it possible to represent a large number of data sets RDD Up operation Spark It's very effective . If the collection is an inclusive range , We use the include range for the last slice .

stay slice Functions and position After the function , Pattern matching is required .
case 1： Range Range, If the scope contains , For the last slice “ Coverage ”
case 2： For long 、 Double precision 、 Large integer, etc
case 3： other -> Conduct position function

position Function input （ The length of the set , Partition number ）, And for the [0,numSlices） To iterate （until Left closed right away ）
Calculate according to the rules start and end, Finally, you will get the partition rules .
// for example （1,2,3,4,5） numslices=3 -> Yes 0,1,2 iteration
// Generate three partition rules （0,1]（1,3]（3,5]
And then use it slice Conduct array segmentation .
That is to say （1）（2,3）（4,5）

原网站

版权声明
本文为[Little five]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202180504027730.html

当前位置：网站首页>Spark source code analysis (I): RDD collection data - partition data allocation

Spark source code analysis (I): RDD collection data - partition data allocation

RDD What is it? ： Distributed elastic datasets
Problems to be solved ：rdd And how to allocate the partition data of the data source ？？？

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Spark source code analysis (I): RDD collection data - partition data allocation

Spark source code analysis (I): RDD collection data - partition data allocation

RDD What is it? ： Distributed elastic datasets Problems to be solved ：rdd And how to allocate the partition data of the data source ？？？

边栏推荐

猜你喜欢

随机推荐

RDD What is it? ： Distributed elastic datasets
Problems to be solved ：rdd And how to allocate the partition data of the data source ？？？