plyr包的特点

其基础函数有以下特点：

第一个参数df
返回df
没有数据更改in place

正是因为有这些特点，才可以使用%>%操作符，方便逻辑式编程。

载入数据

library(plyr)

library(dplyr)

# load packages

suppressMessages(library(dplyr))

install.packages("hflights")

library(hflights)

# explore data

data(hflights)

head(hflights)

# convert to local data frame

flights <- tbl_df(hflights)

# printing only shows 10 rows and as many columns as can fit on your screen

flights

# you can specify that you want to see more rows

print(flights, n=20)

# convert to a normal data frame to see all of the columns

data.frame(head(flights))

filter

keep rows matching criteria

# base R approach to view all flights on January 1

flights[flights$Month==1 & flights$DayofMonth==1, ]

# dplyr approach

# note: you can use comma or ampersand to represent AND condition

filter(flights, Month==1, DayofMonth==1)

# use pipe for OR condition

filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")

# you can also use %in% operator

filter(flights, UniqueCarrier %in% c("AA", "UA"))

select

pick columns by name

# base R approach to select DepTime, ArrTime, and FlightNum columns

flights[, c("DepTime", "ArrTime", "FlightNum")]

# dplyr approach

select(flights, DepTime, ArrTime, FlightNum)

# use colon to select multiple contiguous columns, and use `contains` to match columns by name

# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name

select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

“chaining” or “pipelining”

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes

filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# chaining method

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    filter(DepDelay > 60)

# create two vectors and calculate Euclidian distance between them

x1 <- 1:5; x2 <- 2:6

sqrt(sum((x1-x2)^2))

# chaining method

(x1-x2)^2 %>% sum() %>% sqrt()

arrange

reorder rows

# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay

flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]

# dplyr approach

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(DepDelay)

# use `desc` for descending

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

mutate

add new variable
create new variables that are functions of exciting variables
which is d
ifferent form transform

# base R approach to create a new variable Speed (in mph)

flights$Speed <- flights$Distance / flights$AirTime*60

flights[, c("Distance", "AirTime", "Speed")]

# dplyr approach (prints the new variable but does not store it)

flights %>%

    select(Distance, AirTime) %>%

    mutate(Speed = Distance/AirTime*60)

# store the new variable

flights <- flights %>% mutate(Speed = Distance/AirTime*60)

summarise

reduce variables to values

# base R approaches to calculate the average arrival delay to each destination

head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))

head(aggregate(ArrDelay ~ Dest, flights, mean))

# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay

flights %>%

    group_by(Dest) %>%

    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))

#summarise_each allows you to apply the same summary function to multiple columns at once

#Note: mutate_each is also available

# for each carrier, calculate the percentage of flights cancelled or diverted

flights %>%

    group_by(UniqueCarrier) %>%

    summarise_each(funs(mean), Cancelled, Diverted)

# for each carrier, calculate the minimum and maximum arrival and departure delays

flights %>%

    group_by(UniqueCarrier) %>%

    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))

#Helper function n() counts the number of rows in a group

#Helper function n_distinct(vector) counts the number of unique items in that vector

# for each day of the year, count the total number of flights and sort in descending order

flights %>%

    group_by(Month, DayofMonth) %>%

    summarise(flight_count = n()) %>%

    arrange(desc(flight_count))

# rewrite more simply with the `tally` function

flights %>%

    group_by(Month, DayofMonth) %>%

    tally(sort = TRUE)

# for each destination, count the total number of flights and the number of distinct planes that flew there

flights %>%

    group_by(Dest) %>%

    summarise(flight_count = n(), plane_count = n_distinct(TailNum))

# Grouping can sometimes be useful without summarising

# for each destination, show the number of cancelled and not cancelled flights

flights %>%

    group_by(Dest) %>%

    select(Cancelled) %>%

    table() %>%

    head()

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value
Window function takes n inputs and returns n values
Includes ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).

# for each carrier, calculate which two days of the year they had their longest departure delays

# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value

flights %>%

    group_by(UniqueCarrier) %>%

    select(Month, DayofMonth, DepDelay) %>%

    filter(min_rank(desc(DepDelay)) <= 2) %>%

    arrange(UniqueCarrier, desc(DepDelay))

# rewrite more simply with the `top_n` function

flights %>%

    group_by(UniqueCarrier) %>%

    select(Month, DayofMonth, DepDelay) %>%

    top_n(2,DepDelay) %>%

    arrange(UniqueCarrier, desc(DepDelay))

# for each month, calculate the number of flights and the change from the previous month

flights %>%

    group_by(Month) %>%

    summarise(flight_count = n()) %>%

    mutate(change = flight_count - lag(flight_count))

# rewrite more simply with the `tally` function

flights %>%

    group_by(Month) %>%

    tally() %>%

    mutate(change = n - lag(n))

Other functions

# randomly sample a fixed number of rows, without replacement

flights %>% sample_n(5)

# randomly sample a fraction of rows, with replacement

flights %>% sample_frac(0.25, replace=TRUE)

# base R approach to view the structure of an object

str(flights)

# dplyr approach: better formatting, and adapts to your screen width

glimpse(flights)

Connecting Databases

dplyr can connect to a database as if the data was loaded into a data frame
Use the same syntax for local data frames and databases
Only generates SELECT statements
Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
Example below is based upon an SQLite database containing the hflights data
Instructions for creating this database are in the databases vignette

# connect to an SQLite database containing the hflights data

my_db <- src_sqlite("my_db.sqlite3")

# connect to the "hflights" table in that database

flights_tbl <- tbl(my_db, "hflights")

# example query with our data frame

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

# identical query using the database

flights_tbl %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

You can write the SQL commands yourself
dplyr can tell you the SQL it plans to run and the query execution plan

# send SQL commands to the database

tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))

# ask dplyr for the SQL commands

flights_tbl %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay)) %>%

    explain()

参考资料

R语言包_dplyr_1的更多相关文章

R语言包在linux上的安装等知识
有关install.packages()函数的详见:R包 package 的安装(install.packages函数详解) R的包(package)通常有两种:1 binary package:这种 ...
R语言包
R语言包 R语言的包是R函数,编译代码和样本数据的集合. 它们存储在R语言环境中名为"library"的目录下. 默认情况下,R语言在安装期间安装一组软件包. 随后添加更多包,当它 ...
R语言——包的添加和使用
R是开源的软件工具,很多R语言用户和爱好者都会扩展R的功能模块,我们把这些模块称为包.我们可以通过下载安装这些已经写好的包来完成我们需要的任务工作. 包下载地址:https://cran.r-proj ...
R语言包的安装
pheatmap包的安装 1: 首先R语言的安装路径里面最好不要有中文路径 2: 在安装其他依存的scales和colorspace包时候要关闭防火墙错误提示: 试开URL'https://mirr ...
Windows下使用Rtools编译R语言包
使用devtools安装github中的R源代码时,经常会出各种错误,索性搜了一下怎么在Windows下直接打包,网上的资料也是参差不齐,以下是自己验证通过的. 一.下载Rtools 下载地址:htt ...
r语言包说明
[在实际工作中,每个数据科学项目各不相同,但基本都遵循一定的通用流程.具体如下] [下面列出每个步骤最有用的一些R包] 1.数据导入以下R包主要用于数据导入和保存数据:feather:一种快速,轻 ...
R语言包相关命令
R的包(package)通常有两种:1 binary package:这种包属于即得即用型(ready-to-use),但是依赖与平台,即Win和Linux平台下不同.2 Source package ...
R语言包翻译
Shiny-cheatsheet 作者:周彦通 1.安装 install.packages("shinydashboard") 2.基础知识仪表盘有三个部分:标题.侧边栏,身体 ...
R语言包翻译——翻译
Shiny-cheatsheet ...

随机推荐

[CTCI] 双栈排序
双栈排序题目描述请编写一个程序,按升序对栈进行排序(即最大元素位于栈顶),要求最多只能使用一个额外的栈存放临时数据,但不得将元素复制到别的数据结构中. 给定一个int[] numbers(C++中 ...
支付宝接口错误：您使用的私钥格式错误，请检查RSA私钥配置,charset = utf-8
调试支付宝条码支付的时候碰到个错误:您使用的私钥格式错误,请检查RSA私钥配置,charset = utf-8, 原因是我代码里的那私钥是直接复制pem文件里的代码的,可支付宝底层的sdk中默认是以文 ...
vivado烧写bin文件到flash 中
点击 bitstream setting ,将 bin_file 勾上,点击 OK. 2)点击 generate bitstream ,生成 bit 文件和 bin 文件 3)点击 open hard ...
Python-获取法定节假日
获取公共节假日的接口,http://www.easybots.cn/holiday_api.net, 具体代码如下: # -*- coding:utf-8 -*- import json import ...
python2和Python3异同总结
1. python3 异常不再接收逗号(,)作为参数: ## python3 中这样可以正常运行 try: print("在这里执行的代码,有异常进入except") except ...
python源码安装
# mkdir /apps/Python- 解压源码包,进入源码包 [root@LB_81 Python-]# ls aclocal.m4 configure.ac install-sh Makefi ...
tengine2.1.0RPM包制做 tengine-2.1.0.spec配置
[root@DB SPECS]# cat tengine-2.1.0.spec Name: tengine Version: 2.1.0 Release: 1%{?dist} Summary: ten ...
android笔记---主界面(一)
<?xml version="1.0" encoding="utf-8"?> <TabHost xmlns:android="htt ...
支付宝对账单下载Java沙箱调用
package code; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; impo ...
Android Manifest <meta-data>
在接入第三方渠道SDK的时候,经常会看到其配置文件AndroidManifest.xml有类似如下的定义:  <meta-data android:nam ...

R语言包_dplyr_1

plyr包的特点

载入数据

filter

select

“chaining” or “pipelining”

arrange

mutate

summarise

Window Functions

Other functions

Connecting Databases

参考资料

R语言包_dplyr_1的更多相关文章

随机推荐

热门专题