当前位置:网站首页>Summary of Lin Ziyu spark Scala programming
Summary of Lin Ziyu spark Scala programming
2022-07-16 09:00:00 【The wind of loneliness.】
dataframe Yes first() Return the first row of data , Use head(n) Return to the former n Row data , You can also use take(n) Return to the former n rows
stay RDD In action , use take(n) Return to the former n Elements , use top(n) Before returning in reverse order n Elements
stay dataframe Use in count() Output dataframe Number of rows of object .
stay dataframe Use in distinct() Method returns a Dataframe
stay RDD In the conversion operation of , Is also used distinct() Method to perform the de duplication operation
DataFrame The operation of
val sqlContext = new org.apache.spark.sql.SQLContext(sc)// initialization SQLContext The object is sqlContext,sqlContext The object is Spark SQL Entrance point .
var df = sqlContext.read.format("json").load("D:\\Long\\Spark\\employee.json")// Use json Format creation DataFrame
//1. Query all data
df.show
+----+---+-----+
| age| id| name|
+----+---+-----+
| 36| 1| Ella|
| 29| 2| Bob|
| 29| 3| Jack|
| 28| 4| Jim|
| 28| 4| Jim|
|null| 5|Damon|
|null| 5|Damon|
+----+---+-----+
//2. Query all data , And remove duplicate data
df.distinct().show
+----+---+-----+
| age| id| name|
+----+---+-----+
| 36| 1| Ella|
| 29| 3| Jack|
|null| 5|Damon|
| 29| 2| Bob|
| 28| 4| Jim|
+----+---+-----+
//2. Query all data , Remove when printing id Field
df.select("age","name").show
+----+-----+
| age| name|
+----+-----+
| 36| Ella|
| 29| Bob|
| 29| Jack|
| 28| Jim|
| 28| Jim|
|null|Damon|
|null|Damon|
+----+-----+
// Or use df.drop("id").show Use drop Only one field can be deleted at a time
//3. select age>30 The record of
df.where("age>30").show // Be careful where Li's way of writing "age > 30"
+---+---+-----+
|age| id| name|
+---+---+-----+
| 36| 1| Ella|
+---+---+-----+
//4. Press the data age grouping
df.groupBy("age").count.show
+----+-----+
| age|count|
+----+-----+
| 29| 2|
|null| 2|
| 28| 2|
| 36| 1|
+----+-----+
//5. Press the data name Ascending order
df.sort("name").show
+----+---+-----+
| age| id| name|
+----+---+-----+
| 36| 1| Ella|
| 29| 2| Bob|
|null| 5|Damon|
|null| 5|Damon|
| 29| 3| Jack|
| 28| 4| Jim|
| 28| 4| Jim|
+----+---+-----+
//6. Before removal 3 Row data
df.take(3)//take What is returned is an array of the first several rows of data
Array[org.apache.spark.sql.Row] = Array([36,1, Ella], [29,2,Bob], [29,3,Jack])
df.limit(3).show// While using limit It is composed of an array of the first few rows Dateframe object , It can be used show Method to view
+---+---+-----+
|age| id| name|
+---+---+-----+
| 36| 1| Ella|
| 29| 2| Bob|
| 29| 3| Jack|
+---+---+-----+
//7. Query all records of name Column , And take another name for it as username
df.select(df("name").as("username")).show // It has to be here df(...).as
+--------+
|username|
+--------+
| Ella|
| Bob|
| Jack|
| Jim|
| Jim|
| Damon|
| Damon|
+--------+
//8. Check age age Average value
df.agg(avg("age")).show // Use mean It's also equivalent df.agg(mean("age")).show
+--------+
|avg(age)|
+--------+
| 30.0|
+--------+
//9. Check age age The minimum value of
df.agg(min("age")).show
+--------+
|min(age)|
+--------+
| 28|
+--------+
RDD operation
Data sets
Aaron,OperatingSystem,100
Aaron,Python,50
Aaron,ComputerNetwork,30
....
common 1000+ data
How many students are there in the Department
var rdd = sc.textFile("D:\\Long\\Spark\\chat4.txt")// Use a text file to create RDD var per = rdd.map(x => (x.split(",")(0)))// Array 0 Subscripts indicate names per.distinct().count()//distinct It's a conversion operation , The purpose is to remove the weight ,count It's an action operation , It will directly count the number of elements and output —————————————————————————————————————————— var tem = rdd.map{ row => var splits = row.split(",");(splits(0),splits(1),splits(2).toInt)} tem.map(x => x._1).distinct().count[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-O4hvzEt8-1656335295914)(C:/Users/dell/AppData/Roaming/Typora/typora-user-images/image-20220625171550352.png)]
How many courses does the Department offer
var per = rdd.map(x => (x.split(",")(1)))// Subscript 1 Is the course name per.distinct().count() —————————————————————————————————————————— tem.map(x => x._1).distinct().count[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zOeKimyI-1656335295916)(C:/Users/dell/AppData/Roaming/Typora/typora-user-images/image-20220625171756016.png)]
Tom What's the average score of your classmates
name.map(row => (row.split(",")(0),row.split(",")(2).toInt)).mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2+y._2)).mapValues(x => (x._1/x._2)).collect ———————————————————————————————————————————— tem.filter(x => x._1 == "Tom").map(x => (x._1,x._3)).mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2+y._2)).mapValues(x => x._1/x._2).fore ach(println)Ask for the number of elective courses for each student
rdd.map(row => (row.split(",")(0),row.split(",")(1))).mapValues(x => (x,1)).reduceByKey((x,y) => ("",x._2+y._2)).mapValues(x => x._2).foreach(println)//("",x._2+y._2) The double quotation marks in front must be added , Otherwise, a type matching error will be reported —————————————————————————————————————————————————————————————————————————————————————————————————————————————— tem.map(x => (x._1,x._2)).mapValues(x => (x,1)).reduceByKey((x,y) => ("",x._2+y._2)).mapValues(x => x._2).foreach(println)The Department DataBase How many people take the course
val total = rdd.filter(row => row.split(",")(1)=="DataBase") total.count() //total.map(row => (row.split(",")(1),row.split(",")(0))).mapValues(x => (x,1)).reduceByKey((x,y) => ("",x._2+y._2)).mapVules(x =>x._2).foreach(println) —————————————————————————————————————————————————————————————————————————————— tem.filter(x => x._2 == "DataBase").map(x =>x._1).distinct().countWhat is the average score of each course
rdd.map(row => (row.split(",")(1),row.split(",")(2).toInt)).mapValues(x => (x,1)).reduceByKey((x1,x2) => (x1._1+x2._1,x1._2+x2._2)).mapValues(x => x._1/x._2).foreach(println) —————————————————————————————————————————————————————————————————————————————————————————————————— tem.map(x => (x._2,x._3)).mapValues(x => (x,1)).reduceByKey((x,y) => (x._1+y._1,x._2+y._2)).mapValues(x => x._1/x._2).foreach(println)[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ulFvphFY-1656335295916)(C:/Users/dell/AppData/Roaming/Typora/typora-user-images/image-20220625200010181.png)]
Word frequency statistics
var rdd = sc.textFile("D:\\Long\\Spark\\word.txt")
var ree = rdd.flatMap(row => row.split(" "))// Here we use flatMap instead of map The reason is that there are many words in each line , There are many lines . use flatMap You can merge different sets into one set
var ree1 = ree.map(word => (word,1))// Here we use map instead of mapValues Because this is not the form of key value pairs , So it can't be used mapValues
ree1.reduceByKey((x,y) => x+y).mapValues(x => x).foreach(println)
边栏推荐
猜你喜欢

Hcip static routing

05.01 字符串
![[go] Ⅱ. Introduction à l'API reposante et au processus et à la structure du Code de l'API](/img/fd/8ae3d6a4c0d0c973ce81672c1c529c.png)
[go] Ⅱ. Introduction à l'API reposante et au processus et à la structure du Code de l'API

GPU资源池的虚拟化路径

The computer regularly clears wechat data

Hcip fourth day experiment

HCIP第二个实验

三种方法模拟实现库函数strlen,加深对strlen的理解

vscode 输入 !不提示,没法自动补全的解决方法(最新)

Anonymous pipeline principle and detailed explanation (very practical)
随机推荐
Map set summary
The computer regularly clears wechat data
Hcip fourth day experiment
电脑定时清理微信数据
VC中获取窗体句柄的各种方法
JMeter 21 day clock in day01
LeetCode 2155. All subscripts with the highest score in the group
Response.Write具体介绍
[Go]二、RESTful API介绍和API流程和代码结构
HCIP第五天实验
Burpsite v2.1 Chinese version
程序的运行过程
C language register skills (struct and union)
Common mailbox access protocols
模拟实现库函数strcpy,对strcpy的进一步理解(深刻理解重叠问题,防止内存与源重叠)
How to apply for PMP project management certification examination?
Hcip day 5 experiment
[server data recovery] a data recovery case in which a brand of MSA SAN storage RAID5 is paralyzed and the upper Lun cannot be used
dtcloud 的消息机制(二)
为什么越来越多的人要考PMP项目管理认证?