当前位置:网站首页>R for Data Science (note) -- data transformation (select basic use)
R for Data Science (note) -- data transformation (select basic use)
2022-06-24 19:23:00 【Shengxin Xiaopeng】

tidy Stream processing data is fully used in scientific research , I think it's inconsistent with the pipeline %>% Use , Data processing verb , Has a very important relationship .
In the least amount of time , Solve the most important 、 The most common problem , I call this efficiency ; The remaining difficulties , I call it improvement .
select The use of Verbs
The first thing to be clear is
filter Aiming at That's ok The operation of , select Is an operation on a column
Front learning filter The operation of , This study select operation
### actual combat
Again ,select Filter by column name , And column names do not need quotation marks .
###1. Data style
Still used nycflights13 The data in the package
flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
###2. Filter data
select Filtering data can use a single column name , Sequence symbols can also be used , You can also use “-”
# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 517 515 2 830 819 11 UA
#> 2 533 529 4 850 830 20 UA
#> 3 542 540 2 923 850 33 AA
#> 4 544 545 -1 1004 1022 -18 B6
#> 5 554 600 -6 812 837 -25 DL
#> 6 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
###3. expand 1( Boolean operation )
“:” Used to select a series of continuous variables .
“!” Take the complement of a set of variables .
“&” and “|” Used to select the intersection or union of two sets of variables .
“c()” For combination selection
Here we use starwas, iris These two datasets demonstrate
starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows
“!" Operator negates selection :
starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # ... with 83 more rows
iris %>% select(!c(Sepal.Length, Petal.Length))
#> # A tibble: 150 x 3
#> Sepal.Width Petal.Width Species
#> <dbl> <dbl> <fct>
#> 1 3.5 0.2 setosa
#> 2 3 0.2 setosa
#> 3 3.2 0.2 setosa
#> 4 3.1 0.2 setosa
#> # ... with 146 more rows
iris %>% select(!ends_with("Width"))
#> # A tibble: 150 x 3
#> Sepal.Length Petal.Length Species
#> <dbl> <dbl> <fct>
#> 1 5.1 1.4 setosa
#> 2 4.9 1.4 setosa
#> 3 4.7 1.3 setosa
#> 4 4.6 1.5 setosa
#> # ... with 146 more rows
“&” and “|” Take the intersection or union of two choices :
iris %>% select(starts_with("Petal") & ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Width
#> <dbl>
#> 1 0.2
#> 2 0.2
#> 3 0.2
#> 4 0.2
#> # ... with 146 more rows
iris %>% select(starts_with("Petal") | ends_with("Width"))
#> # A tibble: 150 x 3
#> Petal.Length Petal.Width Sepal.Width
#> <dbl> <dbl> <dbl>
#> 1 1.4 0.2 3.5
#> 2 1.4 0.2 3
#> 3 1.3 0.2 3.2
#> 4 1.5 0.2 3.1
#> # ... with 146 more rows
Use a combination of
iris %>% select(starts_with("Petal") & !ends_with("Width"))
#> # A tibble: 150 x 1
#> Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.4
#> 3 1.3
#> 4 1.5
#> # ... with 146 more rows
Actually select Use , When used in combination with other functions, it can play a powerful role , This is another note .
边栏推荐
- Do you have all the basic embedded knowledge points that novices often ignore?
- Introduction to smart contract security audit delegatecall (2)
- php OSS文件讀取和寫入文件,workerman生成臨時文件並輸出瀏覽器下載
- ArrayList源码解析
- A detailed explanation of the implementation principle of go Distributed Link Tracking
- The cdc+mysql connector joins the date and time field from the dimension table by +8:00. Could you tell me which one is hosted by Alibaba cloud
- NFT双币质押流动性挖矿系统开发
- 論文解讀(SR-GNN)《Shift-Robust GNNs: Overcoming the Limitations of Localized Graph Training Data》
- Necessary fault handling system for enterprise network administrator
- Several ways of connecting upper computer and MES
猜你喜欢

Freeswitch使用originate转dialplan

Sr-gnn shift robot gnns: overlapping the limitations of localized graph training data

Understanding openstack network

A detailed explanation of the implementation principle of go Distributed Link Tracking

Programmers spend most of their time not writing code, but...

Mqtt protocol usage of LabVIEW

一文详解|Go 分布式链路追踪实现原理

How to use R package ggtreeextra to draw evolution tree

Multi cloud mode is not a "master key"

程序员大部分时间不是写代码,而是。。。
随机推荐
Unity移动端游戏性能优化简谱之 以引擎模块为划分的CPU耗时调优
IBPS开源表单设计器有什么功能?
Example analysis of corrplot related heat map beautification in R language
Server lease error in Hong Kong may lead to serious consequences
全链路业务追踪落地实践方案
60 divine vs Code plug-ins!!
Why is nodejs so fast?
程序员大部分时间不是写代码,而是。。。
物联网?快来看 Arduino 上云啦
Does finkcdc support sqlserver2008?
AI时代生物隐私如何保护?马德里自治大学最新《生物特征识别中的隐私增强技术》综述,全面详述生物隐私增强技术
flink-sql的kafka的这个设置,group-offsets,如果指定的groupid没有提
Xiaodi class massive data processing business short chain platform
Power supply noise analysis
一次 MySQL 误操作导致的事故,高可用都不顶不住!
小白请教下各位大佬,cdc抽取mysql binlog是严格顺序的吗
请问一下2.2.0版本支持动态新增mysql同步表吗
The script implements the automated deployment of raid0
How to use R package ggtreeextra to draw evolution tree
工作6年,月薪3W,1名PM的奋斗史