当前位置:网站首页>Spark source code analysis (I): RDD collection data - partition data allocation
Spark source code analysis (I): RDD collection data - partition data allocation
2022-06-26 05:53:00 【Little five】
RDD What is it? : Distributed elastic datasets
Problems to be solved :rdd And how to allocate the partition data of the data source ???
for example (1,2,3,4), Partition number numSlices=3,RDD How to partition data storage in ?
Viewing the source code allows us to quickly understand .
Enter makeRDD function , See the inside is implemented parallelize function , And pass in the number of sets and partitions .
parallelize The function creates a ParallelCollectionRDD object .
And then , Get into ParallelCollectionRDD class Inside .
There is a method with the same name :
Slice the collection into numSlices subset . Another thing we're doing here is dealing with scopes, especially collections , Encode slices into other ranges to minimize memory costs . This makes it possible to represent a large number of data sets RDD Up operation Spark It's very effective . If the collection is an inclusive range , We use the include range for the last slice .
stay slice Functions and position After the function , Pattern matching is required .
case 1: Range Range, If the scope contains , For the last slice “ Coverage ”
case 2: For long 、 Double precision 、 Large integer, etc
case 3: other -> Conduct position function
position Function input ( The length of the set , Partition number ), And for the [0,numSlices) To iterate (until Left closed right away )
Calculate according to the rules start and end, Finally, you will get the partition rules .
// for example (1,2,3,4,5) numslices=3 -> Yes 0,1,2 iteration
// Generate three partition rules (0,1](1,3](3,5]
And then use it slice Conduct array segmentation .
That is to say (1)(2,3)(4,5)
边栏推荐
- 睛天霹雳的消息
- Adapter mode
- Kolla ansible deploy openstack Yoga version
- Consul服务注册与发现
- A new explanation of tcp/ip five layer protocol model
- The news of thunderbolt
- Implementation of third-party wechat authorized login for applet
- 机器学习 07:PCA 及其 sklearn 源码解读
- 循环位移
- Household accounting procedures (the second edition includes a cycle)
猜你喜欢
【C語言】深度剖析數據在內存中的存儲
Combined mode, transparent mode and secure mode
Explore small program audio and video calls and interactive live broadcast from New Oriental live broadcast
卷妹带你学jdbc---2天冲刺Day2
bingc(继承)
pytorch(环境、tensorboard、transforms、torchvision、dataloader)
How Navicat reuses the current connection information to another computer
Kolla ansible deploy openstack Yoga version
uniCloud云开发获取小程序用户openid
怎么把平板作为电脑的第二扩展屏幕
随机推荐
Old love letters
Sql语法中循环的使用
BOM document
【C語言】深度剖析數據在內存中的存儲
Easy to understand from the IDE, and then talk about the applet IDE
工厂方法模式、抽象工厂模式
Status mode, body can change at will
RIA想法
There are applications related to web network request API in MATLAB (under update)
小程序第三方微信授权登录的实现
Unicloud cloud development obtains applet user openid
How Navicat reuses the current connection information to another computer
Sql查询时间段内容
Pytorch (network model training)
最后一次飞翔
Detailed explanation of serial port communication principle 232, 422, 485
自定义WebSerivce作为代理解决SilverLight跨域调用WebService问题
Overloading and overriding
Operator priority, associativity, and whether to control the evaluation order [detailed explanation]
卷妹带你学jdbc---2天冲刺Day2