R-aggregate()

概述

aggregate函数应该是数据处理中常用到的函数，简单说有点类似sql语言中的group by，可以按照要求把数据打组聚合，然后对聚合以后的数据进行加和、求平均等各种操作。

x=data.frame(name=c("张三","李四","王五","赵六"),sex=c("M","M","F","F"),age=c(20,40,22,30),height=c(166,170,150,155))

构造一个很简单的数据，一组人的性别、年龄和身高，可以用aggregate函数来求不同性别的平均年龄和身高

aggregate(x[,3:4],by=list(sex=x$sex),FUN=mean)

几个注意点：

字符或者factor类型的列不要一起加入计算，会报错
by参数要构造成list，如果有多个字段，by就对应队列，和group by多个字段是同样的道理

这个函数的功能比较强大，它首先将数据进行分组（按行），然后对每一组数据进行函数统计，最后把结果组合成一个比较nice的表格返回。根据数据对象不同它有三种用法，分别应用于数据框（data.frame）、公式（formula）和时间序列（ts）：

aggregate(x, by, FUN, ..., simplify = TRUE)

aggregate(formula, data, FUN, ..., subset, na.action = na.omit)

aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1, ts.eps = getOption("ts.eps"), ...)

语法

aggregate(x, ...)

## S3 method for class 'default':

aggregate((x, ...))

## S3 method for class 'data.frame':

aggregate((x, by, FUN, ..., simplify = TRUE))

## S3 method for class 'formula':

aggregate((formula, data, FUN, ...,

          subset, na.action = na.omit))

## S3 method for class 'ts':

aggregate((x, nfrequency = 1, FUN = sum, ndeltat = 1,

          ts.eps = getOption("ts.eps"), ...))

###细节查看  ?aggregate

Example1

我们通过 mtcars 数据集的操作对这个函数进行简单了解。mtcars 是不同类型汽车道路测试的数据框类型数据：

> str(mtcars)

'data.frame': 32 obs. of 11 variables:

$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...

$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...

$ disp: num 160 160 108 258 360 ...

$ hp : num 110 110 93 110 175 105 245 62 95 123 ...

$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...

$ wt : num 2.62 2.88 2.32 3.21 3.44 ...

$ qsec: num 16.5 17 18.6 19.4 17 ...

$ vs : num 0 0 1 1 0 1 0 1 1 1 ...

$ am : num 1 1 1 0 0 0 0 0 0 0 ...

$ gear: num 4 4 4 3 3 3 3 4 4 4 ...

$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

先用attach函数把mtcars的列变量名称加入到变量搜索范围内，然后使用aggregate函数按cyl（汽缸数）进行分类计算平均值：

> attach(mtcars)

> aggregate(mtcars, by=list(cyl), FUN=mean)

Group.1 mpg cyl disp hp drat wt qsec vs am gear carb

1 4 26.66364 4 105.1364 82.63636 4.070909 2.285727 19.13727 0.9090909 0.7272727 4.090909 1.545455

2 6 19.74286 6 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286 0.4285714 3.857143 3.428571

3 8 15.10000 8 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000 0.1428571 3.285714 3.500000

by参数也可以包含多个类型的因子，得到的就是每个不同因子组合的统计结果：

> aggregate(mtcars, by=list(cyl, gear), FUN=mean)

Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb

1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000

2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000

3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333

4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000

5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000

6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000

7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000

8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000

公式（formula）是一种特殊的R数据对象，在aggregate函数中使用公式参数可以对数据框的部分指标进行统计：

> aggregate(cbind(mpg,hp) ~ cyl+gear, FUN=mean)

cyl gear mpg hp

1 4 3 21.500 97.0000

2 6 3 19.750 107.5000

3 8 3 15.050 194.1667

4 4 4 26.925 76.0000

5 6 4 19.750 116.5000

6 4 5 28.200 102.0000

7 6 5 19.700 175.0000

8 8 5 15.400 299.5000

上面的公式 cbind(mpg,hp) ~ cyl+gear 表示使用 cyl 和 gear 的因子组合对 cbind(mpg,hp) 数据进行操作。aggregate在时间序列数据上的应用请参考R的函数说明文档。

Example2



## Compute the averages for the variables in 'state.x77', grouped

## according to the region (Northeast, South, North Central, West) that

## each state belongs to.

aggregate(state.x77, list(Region = state.region), mean)

## Compute the averages according to region and the occurrence of more

## than 130 days of frost.

aggregate(state.x77,

          list(Region = state.region,

               Cold = state.x77[,"Frost"] > 130),

          mean)

## (Note that no state in 'South' is THAT cold.)

## example with character variables and NAs

testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),

                     v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )

by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)

by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)

aggregate(x = testDF, by = list(by1, by2), FUN = "mean")

# and if you want to treat NAs as a group

fby1 <- factor(by1, exclude = "")

fby2 <- factor(by2, exclude = "")

aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")

## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:

aggregate(weight ~ feed, data = chickwts, mean)

aggregate(breaks ~ wool + tension, data = warpbreaks, mean)

aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)

## Dot notation:

aggregate(. ~ Species, data = iris, mean)

aggregate(len ~ ., data = ToothGrowth, mean)

## Often followed by xtabs():

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

xtabs(len ~ ., data = ag)

## Compute the average annual approval ratings for American presidents.

aggregate(presidents, nfrequency = 1, FUN = mean)

## Give the summer less weight.

aggregate(presidents, nfrequency = 1,

          FUN = weighted.mean, w = c(1, 1, 0.5, 1))

Example3

#load data

data <- ChickWeight

head(data)

  weight Time Chick Diet

1     42    0     1    1

2     51    2     1    1

3     59    4     1    1

4     64    6     1    1

5     76    8     1    1

6     93   10     1    1

#dimension of the data

dim(data)

[1] 578   4

#how many chickens

unique(data$Chick)

 [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

[31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 < 3 < 1 < 12 < ... < 48

#how many diets

unique(data$Diet)

[1] 1 2 3 4

Levels: 1 2 3 4

#how many time points

unique(data$Time)

 [1]  0  2  4  6  8 10 12 14 16 18 20 21

library(ggplot2)

ggplot(data=data, aes(x=Time, y=weight, group=Chick, colour=Chick)) +

       geom_line() +

       geom_point()

------------------------------------------------------

## S3 method for class 'data.frame'

## aggregate(x, by, FUN, ..., simplify = TRUE)

#find the mean weight depending on diet

aggregate(data$weight, list(diet = data$Diet), mean)

  diet        x

1    1 102.6455

2    2 122.6167

3    3 142.9500

4    4 135.2627

#aggregate on time

aggregate(data$weight, list(time=data$Time), mean)

   time         x

1     0  41.06000

2     2  49.22000

3     4  59.95918

4     6  74.30612

5     8  91.24490

6    10 107.83673

7    12 129.24490

8    14 143.81250

9    16 168.08511

10   18 190.19149

11   20 209.71739

12   21 218.68889

#use a different function

aggregate(data$weight, list(time=data$Time), sd)

   time         x

1     0  1.132272

2     2  3.688316

3     4  4.495179

4     6  9.012038

5     8 16.239780

6    10 23.987277

7    12 34.119600

8    14 38.300412

9    16 46.904079

10   18 57.394757

11   20 66.511708

12   21 71.510273

#we could also aggregate on time and diet

head(aggregate(data$weight,

               list(time = data$Time, diet = data$Diet),

               mean

              )

    )

  time diet        x

1    0    1 41.40000

2    2    1 47.25000

3    4    1 56.47368

4    6    1 66.78947

5    8    1 79.68421

6   10    1 93.05263

tail(aggregate(data$weight,

               list(time = data$Time, diet = data$Diet),

               mean

              )

    )

   time diet        x

43   12    4 151.4000

44   14    4 161.8000

45   16    4 182.0000

46   18    4 202.9000

47   20    4 233.8889

48   21    4 238.5556

#to see the weights over time across different diets

ggplot(data) + geom_line(aes(x=Time, y=weight, colour=Chick)) +

             facet_wrap(~Diet) +

             guides(col=guide_legend(ncol=3))

Example4

The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

# Get a count of number of subjects in each category (sex*condition)

cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)

cdata

#>   sex condition subject

#> 1   F   aspirin       5

#> 2   M   aspirin       9

#> 3   F   placebo      12

#> 4   M   placebo       4

# Rename "subject" column to "N"

names(cdata)[names(cdata)=="subject"] <- "N"

cdata

#>   sex condition  N

#> 1   F   aspirin  5

#> 2   M   aspirin  9

#> 3   F   placebo 12

#> 4   M   placebo  4

# Sort by sex first

cdata <- cdata[order(cdata$sex),]

cdata

#>   sex condition  N

#> 1   F   aspirin  5

#> 3   F   placebo 12

#> 2   M   aspirin  9

#> 4   M   placebo  4

# We also keep the __before__ and __after__ columns:

# Get the average effect size by sex and condition

cdata.means <- aggregate(data[c("before","after","change")],

                         by = data[c("sex","condition")], FUN=mean)

cdata.means

#>   sex condition   before     after    change

#> 1   F   aspirin 11.06000  7.640000 -3.420000

#> 2   M   aspirin 11.26667  5.855556 -5.411111

#> 3   F   placebo 10.13333  8.075000 -2.058333

#> 4   M   placebo 11.47500 10.500000 -0.975000

# Merge the data frames

cdata <- merge(cdata, cdata.means)

cdata

#>   sex condition  N   before     after    change

#> 1   F   aspirin  5 11.06000  7.640000 -3.420000

#> 2   F   placebo 12 10.13333  8.075000 -2.058333

#> 3   M   aspirin  9 11.26667  5.855556 -5.411111

#> 4   M   placebo  4 11.47500 10.500000 -0.975000

# Get the sample (n-1) standard deviation for "change"

cdata.sd <- aggregate(data["change"],

                      by = data[c("sex","condition")], FUN=sd)

# Rename the column to change.sd

names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"

cdata.sd

#>   sex condition change.sd

#> 1   F   aspirin 0.8642916

#> 2   M   aspirin 1.1307569

#> 3   F   placebo 0.5247655

#> 4   M   placebo 0.7804913

# Merge

cdata <- merge(cdata, cdata.sd)

cdata

#>   sex condition  N   before     after    change change.sd

#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916

#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655

#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569

#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913

# Calculate standard error of the mean

cdata$change.se <- cdata$change.sd / sqrt(cdata$N)

cdata

#>   sex condition  N   before     after    change change.sd change.se

#> 1   F   aspirin  5 11.06000  7.640000 -3.420000 0.8642916 0.3865230

#> 2   F   placebo 12 10.13333  8.075000 -2.058333 0.5247655 0.1514867

#> 3   M   aspirin  9 11.26667  5.855556 -5.411111 1.1307569 0.3769190

#> 4   M   placebo  4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

cdata.means <- aggregate(data[c("before","after","change")],

                         by = data[c("sex","condition")],

                         FUN=mean, na.rm=TRUE)

cdata.means

#>   sex condition   before     after    change

#> 1   F   aspirin 11.06000  7.640000 -3.420000

#> 2   M   aspirin 11.26667  5.855556 -5.411111

#> 3   F   placebo 10.13333  8.075000 -2.058333

#> 4   M   placebo 11.47500 10.500000 -0.975000

R-aggregate()的更多相关文章

pandas聚合aggregate
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2018/5/24 15:03 # @Author : zhang chao # @Fi ...
quotas and disk replace on netapp
==================================================================================================== ...
【原创】JQWidgets-TreeGrid 2、初探源码
已知JQWidgets的TreeGrid组件依赖于jqxcore.js.jqxtreegrid.js,实际上它还依赖于jqxdatatable.js.我们先通过一个例子,来探索本次的话题. 需求: 图 ...
Python数据分析_Pandas_窗函数
窗函数(window function)经常用在频域信号分析中.我其实不咋个懂,大概是从无限长的信号中截一段出来,然后把这一段做延拓变成一个虚拟的无限长的信号.用来截取的函数就叫窗函数,窗函数又分很多 ...
Flink – WindowedStream
在WindowedStream上可以执行,如reduce,aggregate,min,max等操作关键是要理解windowOperator对KVState的运用,因为window是用它来存储wind ...
Pandas | 16 聚合
当有了滚动,扩展和ewm对象创建了以后,就有几种方法可以对数据执行聚合. DataFrame应用聚合可以通过向整个DataFrame传递一个函数来进行聚合,或者通过标准的获取项目方法来选择一个列. ...
pandas tutorial
目录 Series 利用dict来创建series 利用标量创建series 取 Dataframe 利用dict创建dataframe 选择添加列列移除行的选择, 添加, 移除 Panel B ...
[原]CentOS7安装Rancher2.1并部署kubernetes (二)---部署kubernetes
################## Rancher v2.1.7 + Kubernetes 1.13.4 ################ ##################### ...
利用python进行数据分析2_数据采集与操作
txt_filename = './files/python_baidu.txt' # 打开文件 file_obj = open(txt_filename, 'r', encoding='utf-8' ...
Django项目：CRM(客户关系管理系统)--81--71PerfectCRM实现CRM项目首页
{#portal.html#} {## ————————46PerfectCRM实现登陆后页面才能访问————————#} {#{% extends 'king_admin/table_index.h ...

随机推荐

CSS——font使用方法
<style> p{ /*font-style: italic;/*设置字体为斜体*/ font-variant: small-caps; font-weight: bolder;/*设置 ...
TensorFlow良心入门教程
All the matrials come from Machine Learning class in Polyu,HK and I reorganize them and add referenc ...
ubuntu18.3完美安装qq
创建一个脚本全自动安装 #!/bin/bash # 安装 deepin-wine sudo mkdir deepin-wine deepin-qq cd deepin-wine git clone h ...
Linux 下vim命令详解
原博文:https://www.cnblogs.com/zknublx/p/6058679.html 高级一些的编辑器,都会包含宏功能,vim当然不能缺少了,在vim中使用宏是非常方便的: :qx ...
Action向视图传值的6种方式（转）
在使用ASP.NET MVC进行项目开发时,经常会碰到从Action向视图传值的问题,今天我就把我所知道的方式总结了一下,分成了以下六种: 1.使用ViewData进行传值在Action中,有如下代 ...
mysql sleep 死锁例子
表结构 CREATE TABLE `orders` ( `order_id` int(11) NOT NULL, `order_addr` varchar(255) DEFAULT NULL ) EN ...
ubuntu16.04环境LNMP实现PHP5.6和PHP7.2
最近因为公司突然间说要升级php7,所以做个记录 PPA 方式安装 php7.2 : sudo apt-get install software-properties-common 添加 php7 的 ...
安装openblas库
http://www.openblas.net/ Linux:下载源码直接make即可
conda创建和使用python的虚拟环境
https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/ 当我们使用服务器的时候,会存在多个用户,并且可能 ...
TP-LINK WR941N路由器研究
TP-LINK WR941N路由器研究之前看到了一个CVE, CVE-2017-13772 是TP-Link WR940N后台的RCE, 手头上正好有一个TP-Link WR941N的设备,发现也存 ...