概述

aggregate函数应该是数据处理中常用到的函数,简单说有点类似sql语言中的group by,可以按照要求把数据打组聚合,然后对聚合以后的数据进行加和、求平均等各种操作。

x=data.frame(name=c("张三","李四","王五","赵六"),sex=c("M","M","F","F"),age=c(20,40,22,30),height=c(166,170,150,155))

构造一个很简单的数据,一组人的性别、年龄和身高,可以用aggregate函数来求不同性别的平均年龄和身高

aggregate(x[,3:4],by=list(sex=x$sex),FUN=mean)

几个注意点:

  • 字符或者factor类型的列不要一起加入计算,会报错
  • by参数要构造成list,如果有多个字段,by就对应队列,和group by多个字段是同样的道理

这个函数的功能比较强大,它首先将数据进行分组(按行),然后对每一组数据进行函数统计,最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法,分别应用于数据框(data.frame)、公式(formula)和时间序列(ts):

aggregate(x, by, FUN, ..., simplify = TRUE)
aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)

语法

aggregate(x, ...)

## S3 method for class 'default':
aggregate((x, ...)) ## S3 method for class 'data.frame':
aggregate((x, by, FUN, ..., simplify = TRUE)) ## S3 method for class 'formula':
aggregate((formula, data, FUN, ...,
subset, na.action = na.omit)) ## S3 method for class 'ts':
aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1,
ts.eps = getOption("ts.eps"), ...)) ###细节查看 ?aggregate

Example1

我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据:

> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

先用attach函数把mtcars的列变量名称加入到变量搜索范围内,然后使用aggregate函数按cyl(汽缸数)进行分类计算平均值:

> attach(mtcars)
> aggregate(mtcars, by=list(cyl), FUN=mean)
Group.1 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455
2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571
3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

by参数也可以包含多个类型的因子,得到的就是每个不同因子组合的统计结果:

> aggregate(mtcars, by=list(cyl, gear), FUN=mean)

Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000

公式(formula)是一种特殊的R数据对象,在aggregate函数中使用公式参数可以对数据框的部分指标进行统计:

> aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)
cyl gear mpg hp
1 4 3 21.500 97.0000
2 6 3 19.750 107.5000
3 8 3 15.050 194.1667
4 4 4 26.925 76.0000
5 6 4 19.750 116.5000
6 4 5 28.200 102.0000
7 6 5 19.700 175.0000
8 8 5 15.400 299.5000

上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。

Example2


## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean) ## Compute the averages according to region and the occurrence of more
## than 130 days of frost.
aggregate(state.x77,
list(Region = state.region,
Cold = state.x77[,"Frost"] > 130),
mean)
## (Note that no state in 'South' is THAT cold.) ## example with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
aggregate(x = testDF, by = list(by1, by2), FUN = "mean") # and if you want to treat NAs as a group
fby1 <- factor(by1, exclude = "")
fby2 <- factor(by2, exclude = "")
aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean") ## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
aggregate(weight ~ feed, data = chickwts, mean)
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum) ## Dot notation:
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean) ## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
xtabs(len ~ ., data = ag) ## Compute the average annual approval ratings for American presidents.
aggregate(presidents, nfrequency = 1, FUN = mean)
## Give the summer less weight.
aggregate(presidents, nfrequency = 1,
FUN = weighted.mean, w = c(1, 1, 0.5, 1))

Example3

#load data
data <- ChickWeight
head(data)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1 #dimension of the data
dim(data)
[1] 578 4 #how many chickens
unique(data$Chick)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48 #how many diets
unique(data$Diet)
[1] 1 2 3 4
Levels: 1 2 3 4 #how many time points
unique(data$Time)
[1] 0 2 4 6 8 10 12 14 16 18 20 21 library(ggplot2)
ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +
geom_line() +
geom_point() ------------------------------------------------------ ## S3 method for class 'data.frame'
## aggregate(x, by, FUN, ..., simplify = TRUE) #find the mean weight depending on diet
aggregate(data$weight, list(diet = data$Diet), mean)
diet x
1 1 102.6455
2 2 122.6167
3 3 142.9500
4 4 135.2627 #aggregate on time
aggregate(data$weight, list(time=data$Time), mean)
time x
1 0 41.06000
2 2 49.22000
3 4 59.95918
4 6 74.30612
5 8 91.24490
6 10 107.83673
7 12 129.24490
8 14 143.81250
9 16 168.08511
10 18 190.19149
11 20 209.71739
12 21 218.68889 #use a different function
aggregate(data$weight, list(time=data$Time), sd)
time x
1 0 1.132272
2 2 3.688316
3 4 4.495179
4 6 9.012038
5 8 16.239780
6 10 23.987277
7 12 34.119600
8 14 38.300412
9 16 46.904079
10 18 57.394757
11 20 66.511708
12 21 71.510273 #we could also aggregate on time and diet
head(aggregate(data$weight,
list(time = data$Time, diet = data$Diet),
mean
)
)
time diet x
1 0 1 41.40000
2 2 1 47.25000
3 4 1 56.47368
4 6 1 66.78947
5 8 1 79.68421
6 10 1 93.05263
tail(aggregate(data$weight,
list(time = data$Time, diet = data$Diet),
mean
)
)
time diet x
43 12 4 151.4000
44 14 4 161.8000
45 16 4 182.0000
46 18 4 202.9000
47 20 4 233.8889
48 21 4 238.5556 #to see the weights over time across different diets
ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +
facet_wrap(~Diet) +
guides(col=guide_legend(ncol=3))

Example4

The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

# Get a count of number of subjects in each category (sex*condition)
cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
cdata
#> sex condition subject
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4 # Rename "subject" column to "N"
names(cdata)[names(cdata)=="subject"] <- "N"
cdata
#> sex condition N
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4 # Sort by sex first
cdata <- cdata[order(cdata$sex),]
cdata
#> sex condition N
#> 1 F aspirin 5
#> 3 F placebo 12
#> 2 M aspirin 9
#> 4 M placebo 4 # We also keep the __before__ and __after__ columns:
# Get the average effect size by sex and condition
cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")], FUN=mean)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06000 7.640000 -3.420000
#> 2 M aspirin 11.26667 5.855556 -5.411111
#> 3 F placebo 10.13333 8.075000 -2.058333
#> 4 M placebo 11.47500 10.500000 -0.975000 # Merge the data frames
cdata <- merge(cdata, cdata.means)
cdata
#> sex condition N before after change
#> 1 F aspirin 5 11.06000 7.640000 -3.420000
#> 2 F placebo 12 10.13333 8.075000 -2.058333
#> 3 M aspirin 9 11.26667 5.855556 -5.411111
#> 4 M placebo 4 11.47500 10.500000 -0.975000 # Get the sample (n-1) standard deviation for "change"
cdata.sd <- aggregate(data["change"],
by = data[c("sex","condition")], FUN=sd)
# Rename the column to change.sd
names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
cdata.sd
#> sex condition change.sd
#> 1 F aspirin 0.8642916
#> 2 M aspirin 1.1307569
#> 3 F placebo 0.5247655
#> 4 M placebo 0.7804913 # Merge
cdata <- merge(cdata, cdata.sd)
cdata
#> sex condition N before after change change.sd
#> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916
#> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655
#> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569
#> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 # Calculate standard error of the mean
cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
cdata
#> sex condition N before after change change.sd change.se
#> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230
#> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867
#> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190
#> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

cdata.means <- aggregate(data[c("before","after","change")],
by = data[c("sex","condition")],
FUN=mean, na.rm=TRUE)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06000 7.640000 -3.420000
#> 2 M aspirin 11.26667 5.855556 -5.411111
#> 3 F placebo 10.13333 8.075000 -2.058333
#> 4 M placebo 11.47500 10.500000 -0.975000

R-aggregate()的更多相关文章

  1. pandas聚合aggregate

    #!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2018/5/24 15:03 # @Author : zhang chao # @Fi ...

  2. quotas and disk replace on netapp

    ==================================================================================================== ...

  3. 【原创】JQWidgets-TreeGrid 2、初探源码

    已知JQWidgets的TreeGrid组件依赖于jqxcore.js.jqxtreegrid.js,实际上它还依赖于jqxdatatable.js.我们先通过一个例子,来探索本次的话题. 需求: 图 ...

  4. Python数据分析_Pandas_窗函数

    窗函数(window function)经常用在频域信号分析中.我其实不咋个懂,大概是从无限长的信号中截一段出来,然后把这一段做延拓变成一个虚拟的无限长的信号.用来截取的函数就叫窗函数,窗函数又分很多 ...

  5. Flink – WindowedStream

    在WindowedStream上可以执行,如reduce,aggregate,min,max等操作 关键是要理解windowOperator对KVState的运用,因为window是用它来存储wind ...

  6. Pandas | 16 聚合

    当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合. DataFrame应用聚合 可以通过向整个DataFrame传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列. ...

  7. pandas tutorial

    目录 Series 利用dict来创建series 利用标量创建series 取 Dataframe 利用dict创建dataframe 选择 添加列 列移除 行的选择, 添加, 移除 Panel B ...

  8. [原]CentOS7安装Rancher2.1并部署kubernetes (二)---部署kubernetes

    ##################    Rancher v2.1.7  +    Kubernetes 1.13.4  ################ ##################### ...

  9. 利用python进行数据分析2_数据采集与操作

    txt_filename = './files/python_baidu.txt' # 打开文件 file_obj = open(txt_filename, 'r', encoding='utf-8' ...

  10. Django项目:CRM(客户关系管理系统)--81--71PerfectCRM实现CRM项目首页

    {#portal.html#} {## ————————46PerfectCRM实现登陆后页面才能访问————————#} {#{% extends 'king_admin/table_index.h ...

随机推荐

  1. luogu题解 P2860[USACO冗余路径Redundant Paths] 缩点+桥

    题目链接 https://www.luogu.org/problemnew/show/P2860 https://www.lydsy.com/JudgeOnline/problem.php?id=17 ...

  2. XVS 操作

    1. xvs安装   rpm -i  ***.rpm 2.获取license root@ubuntu:/usr/local/xvs# ./xvs -L .Host ID: 16b3d720584704 ...

  3. 微信小程序常用事件

    bind bind事件绑定不会阻止冒泡事件向上冒泡,catch事件绑定可以阻止冒泡事件向上冒泡. bindtap  跳转页面 bindchange  .value 改变时触发 change 事件 bi ...

  4. IIS 6.0 PUT上传 任意文件创建漏洞

    IIS 6.0 PUT上传 任意文件创建漏洞 require 1.IIS Server在Web服务扩展中开启了WebDAV. 2.IIS配置了可以写入的权限,包括网站 1.根目录 2.子文件夹 3.i ...

  5. 在线程中使用ClientQuery注意的问题

    今天遇到奇怪的问题,在线程中建立一个TkbmMWClientQuery的临时对象q,及一个TkbmMWBinaryStreamFormat的临时对象bsf,第一次执行正常,再次执行时一直等待,也不产生 ...

  6. 02.Zabbix⾃定义监控项

    1.zabbix⾃定义监控初试 如何获取系统中想监控对象的值,获取后⼜如何将该值传递给Zabbix-Server 1.1.监控系统中的对象 #(系统监控命令 + awk + 筛选条件 = 监控的状态值 ...

  7. kubernetes资源清单之pod

    什么是pod? Pod是一组一个或多个容器(例如Docker容器),具有共享的存储/网络,以及有关如何运行这些容器的规范. Pod的内容始终位于同一地点,并在同一时间安排,并在共享上下文中运行. Po ...

  8. Django的Auth模块

    1 Auth模块是什么 Auth模块是Django自带的用户认证模块: 我们在开发一个网站的时候,无可避免的需要设计实现网站的用户系统.此时我们需要实现包括用户注册.用户登录.用户认证.注销.修改密码 ...

  9. (十二)Linux Kernel suspend and resume

    一.对于休眠(suspend)的简单介绍   在Linux中,休眠主要分三个主要的步骤:   1) 冻结用户态进程和内核态任务   2) 调用注册的设备的suspend的回调函数, 顺序是按照注册顺序 ...

  10. Vue快速学习_第四节

    获取原生的DOM方式($.refs) 给标签或者组件 添加ref <div ref = 'liu'>test</div> <Home ref = 'home'>&l ...