当前位置:网站首页>Introduction to partition operators, broadcast variables and accumulators of 32 spark
Introduction to partition operators, broadcast variables and accumulators of 32 spark
2022-07-23 10:36:00 【Portrait people under big data】
17.11 operator ( Partition )️
17.11.1 Conversion operator
mapPartitionsWithIndex
- Be similar to mapPartitions, In addition, it also carries the index value of the partition
repartition
- Add or subtract partitions . This operator produces shuffle
coalesce
- coalesce Often used to reduce partitions , The second parameter in the operator is whether the partition reduction process produces shuffle .
- true In order to produce shuffle , false Do not produce shuffle . The default is false .
- If coalesce Set the number of partitions than the original RDD If the number of partitions is more , The second parameter is set to false It won't work ( The number of partitions after conversion is greater than that before ), If I set it to true , Effect and repartition equally .
repartition(numPartitions) = coalesce(numPartitions,true)groupByKey
- It works on K,V Format RDD On . according to Key Grouping . It works on (K,V) , return
(KIterable<V>) - groupByKey and reduceByKey difference
- reduceByKey Is a grouping aggregation class operator , stay Map The end turns on aggregation by default , And the aggregation logic must be consistent with Reduce End consistent , That is, the aggregate function passed in by f Appoint ;
- groupByKey Is a grouping collection class operator , stay Map End will not produce combine() , Just put the same key The data are collected together , Will not receive similar f Function parameters of
- It works on K,V Format RDD On . according to Key Grouping . It works on (K,V) , return
zip
- Put two RDD The elements in ( KV Format / Not KV Format ) To become a KV Format RDD , Two RDD The number of must be the same , At the same time, the number of partitions must also be the same .
zipWithIndex
- This function will RDD The element in and the element in RDD Index number in ( from 0 Start ) Combine into (K,V) Yes
17.11.2 Action operator
- countByKey
- Effect to K,V Format RDD On , according to Key The count is the same Key Data set elements of .
- countByValue
- Count according to the same content of each element in the dataset . Returns the number of elements with the same content .
- reduce
- Aggregate each element in the dataset according to the aggregation logic
17.12 Case answer ️
17.12.1 PV&UV
17.12.2 Two order
17.12.3 Take... In groups topN
17.13 Broadcast variables and accumulators ️
17.13.1 Broadcast variables
The illustration :

The use of broadcast variables
val conf = new SparkConf() conf.setMaster("local").setAppName("brocast") val sc = new SparkContext(conf) val list = List("hello yjx") val broadCast = sc.broadcast(list) val lineRDD = sc.textFile("./words.txt") lineRDD.filter { x => broadCast.value.contains(x) }.foreach { println} sc.stop()matters needing attention
- Broadcast variables can only be Driver End definition , Can't be in Executor End definition .
- stay Driver You can change the value of the broadcast variable , stay Executor End can't modify the value of broadcast variable
17.13.2 accumulator
The illustration :

The use of accumulators
val conf = new SparkConf() conf.setMaster("local").setAppName("accumulator") val sc = new SparkContext(conf) val accumulator = sc.longAccumulator sc.textFile("./words.txt").foreach { x =>{ accumulator.add(1)}} println(accumulator.value) sc.stop()matters needing attention
- Accumulator in Driver Initial value assigned to end definition , Accumulator can only be Driver End read , stay Executor End update
边栏推荐
- Industry insight | how to better build a data center? It and business should "go together"
- Richview textbox items textbox
- 振奋人心 元宇宙!下一代互联网的财富风口
- redis伪集群一键部署脚本---亲测可用
- Redis token record user login design solution?
- 注册树模式
- 32 < tag array and bit operation > supplement: Lt. sword finger offer 56 - I. number of occurrences of numbers in the array
- [pytorch] the difference between cuda() and to (device)
- 大专码农和 985 程序员有什么区别?
- Practice of RTC performance automation tool in memory optimization scenario
猜你喜欢
![[qt5.12] qt5.12 installation tutorial](/img/b2/c41a38ad6033da9adf64215f8f02a1.png)
[qt5.12] qt5.12 installation tutorial

MySQL three table query problem

The safe distance between you and personal information leakage may be decided by a laptop!

无套路、无陷阱、无广告 | 这个免费的即时通讯软件确定不用吗?

CLion + MinGW64配置C语言开发环境 Visual Studio安装

Sonar中如何删除一个项目

CloudCompare&PCL 点云点匹配(基于点到面的距离)

"Lost wake up problem" in multithreading | why do wait() and notify() need to be used with the synchronized keyword?

Kingbasees SQL language reference manual of Jincang database (8. Function (9))

【Qt5.12】Qt5.12安装教程
随机推荐
kex_exchange_identification: read: Connection reset by peer 不完美解决办法(之一)
Network data leakage events occur frequently, how to protect personal privacy information?
2022/7/20
SAP 批导模板(WBS批导为例)
Redis安装
How to delete an item in sonar
Chapter 3 Standard Input
2. Judgment statement
仅用5000行代码,在V853上AI渲染出一亿幅山水画
AI性能拉满的“广和通AI智能模组SCA825-W”加速推进电商直播2.0时代
[pytorch] the difference between cuda() and to (device)
Special training - linked list
Redis事务-秒杀案例模拟实现详细过程
DPDK 交叉编译基本流程
New file / filter / folder in VS
31-spark的算子使用及调度流程
2022/7/20
CV (3)- CNNs
元宇宙浪潮震撼来袭,抓住时机,齐心协力
金仓数据库 KingbaseES SQL 语言参考手册 (8. 函数(五))