pandas 之 groupby 聚合函数
import numpy as np
import pandas as pd
聚合函数
Aggregations refer to any data transformation that produces scalar values from arrays(输入是数组, 输出是标量值). The preceding examples have used several of them, including mean, count, min, and sum You may wonder what is going on when you invoke mean() on a GroupBy object, Many common aggregations such as those found in Table 10-1, have optimized implementations. However, you are not limited to only this set of methods.
- count
- sum
- mean
- median
- std, var
- min, max
- prod
- first, last
You can use aggregations of your own devising and additionally call any method that is also dedined on the grouped object.
For example, you might recall that quantile computes sample quantiles of a Series or a DataFrame.
While quantile is not explicitly implemented for GroupBy, it's a Series method an thus available for use. Internally, GroupBy efficiently slices up the Series, call piece.quantile(0.9) for each piece, and then assembles those results together into the result object:
您可以使用您自己设计的聚合,并额外调用在分组对象上也禁用的任何方法。例如,您可能还记得分位数计算序列或数据流的样本分位数。虽然分位数没有显式地为GroupBy实现,但它是一个系列方法,因此可以使用。在内部,GroupBy有效地分割该系列,为每个片段调用piece.quantile(0.9),然后将这些结果组合到result对象中
df = pd.DataFrame({
'key1': 'a a b b a'.split(),
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)
})
df
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | 1.296733 | -0.756093 |
1 | a | two | -1.389859 | -1.027718 |
2 | b | one | -0.846801 | -0.802681 |
3 | b | two | 1.200620 | -1.328187 |
4 | a | one | 0.002991 | -1.223807 |
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9) # 0.9分位数
key1
a 1.037985
b 0.995878
Name: data1, dtype: float64
To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method
def peak_to_peak(arr):
"""计算数组的极差"""
return arr.max() - arr.min()
grouped.agg(peak_to_peak) # 计算各组类的极差, 类似apply
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
size | tip | tip_pct | total_bill | ||
---|---|---|---|---|---|
day | smoker | ||||
Fri | No | 1 | 2.00 | 0.067349 | 10.29 |
Yes | 3 | 3.73 | 0.159925 | 34.42 | |
Sat | No | 3 | 8.00 | 0.235193 | 41.08 |
Yes | 4 | 9.00 | 0.290095 | 47.74 | |
Sun | No | 4 | 4.99 | 0.193226 | 39.40 |
Yes | 3 | 5.00 | 0.644685 | 38.10 | |
Thur | No | 5 | 5.45 | 0.193350 | 33.68 |
Yes | 2 | 3.00 | 0.151240 | 32.77 |
You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking(鸭子类型吗?):
grouped.describe() # 描述分组信息, 只会对数值相关的哦
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
data1 | data2 | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
key1 | ||||||||||||||||
a | 3.0 | -0.030045 | 1.343601 | -1.389859 | -0.693434 | 0.002991 | 0.649862 | 1.296733 | 3.0 | -1.002539 | 0.234871 | -1.223807 | -1.125762 | -1.027718 | -0.891905 | -0.756093 |
b | 2.0 | 0.176910 | 1.447745 | -0.846801 | -0.334946 | 0.176910 | 0.688765 | 1.200620 | 2.0 | -1.065434 | 0.371589 | -1.328187 | -1.196811 | -1.065434 | -0.934057 | -0.802681 |
I will explain in more detail what has happend here in Section 10.3 Apply: General split-apply-combine on page 302
列的多功能扩展
Let's return to the tipping dataset from earlier examples. After loading it with read_csv, we add a tipping percentage column tip_pct
新增一列, tip所占的百分比
tips = pd.read_csv('../examples/tips.csv')
tips.info()
tips.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 6 columns):
total_bill 244 non-null float64
tip 244 non-null float64
smoker 244 non-null object
day 244 non-null object
time 244 non-null object
size 244 non-null int64
dtypes: float64(2), int64(1), object(3)
memory usage: 11.5+ KB
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
total_bill | tip | smoker | day | time | size | |
---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 |
"新增一列 tip_pct"
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]
'新增一列 tip_pct'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
total_bill | tip | smoker | day | time | size | tip_pct | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | No | Sun | Dinner | 4 | 0.146808 |
5 | 25.29 | 4.71 | No | Sun | Dinner | 4 | 0.186240 |
As you've already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending o the column, or multiple functions at once. Fortunately, this is possible to do, which i'll illustrate through a number of examples. First, i'll group the tips by day and smoker:
对Series, DataFrame 的某列or所有的一个聚合, 通常会使用一个聚合函数去映射, 然而, 你可能想要总使用不同的函数去映射不同的列. 幸运的是,这是可以做到的,我将通过一些例子来说明这一点. 首先,我将把每天的小费和吸烟者分成两组
grouped = tips.groupby(['day', 'smoker'])
Note that for descriptive statistics like those in Table 10-1, you can pass the name of the function a s a string:
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
# cj 分组统计一波
grouped_pct.agg('count')
day smoker
Fri No 4
Yes 15
Sat No 45
Yes 42
Sun No 57
Yes 19
Thur No 45
Yes 17
Name: tip_pct, dtype: int64
If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:
"对1or多个列, 进行1or多个聚合, 并排展开, 厉害了"
grouped_pct.agg(['mean', 'std', peak_to_peak])
'对1or多个列, 进行1or多个聚合, 并排展开, 厉害了'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
mean | std | peak_to_peak | ||
---|---|---|---|---|
day | smoker | |||
Fri | No | 0.151650 | 0.028123 | 0.067349 |
Yes | 0.174783 | 0.051293 | 0.159925 | |
Sat | No | 0.158048 | 0.039767 | 0.235193 |
Yes | 0.147906 | 0.061375 | 0.290095 | |
Sun | No | 0.160113 | 0.042347 | 0.193226 |
Yes | 0.187250 | 0.154134 | 0.644685 | |
Thur | No | 0.160298 | 0.038774 | 0.193350 |
Yes | 0.163863 | 0.039389 | 0.151240 |
Here we passed a list of aggregations functions to agg to evaluate indepedently on the data groups.
You don't need to accept the names that GroupBy gives to the columns; notably(尤其) lambda functions have the name <lambda which makes them hard to identify(you can see for yourself by looking at a function's __ name__ attribute.) Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names.
(you can think of a list of 2-tuple as an ordered mapping)
"给分组字段取别名"
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
'给分组字段取别名'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
foo | bar | ||
---|---|---|---|
day | smoker | ||
Fri | No | 0.151650 | 0.028123 |
Yes | 0.174783 | 0.051293 | |
Sat | No | 0.158048 | 0.039767 |
Yes | 0.147906 | 0.061375 | |
Sun | No | 0.160113 | 0.042347 |
Yes | 0.187250 | 0.154134 | |
Thur | No | 0.160298 | 0.038774 |
Yes | 0.163863 | 0.039389 |
With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column (对不同的列进行不同的函数映射apply). To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:
functions = ['count', 'mean', 'max']
"实现对任意字段的任意操作, 分别"
result = grouped['tip_pct', 'total_bill'].agg(functions)
result
'实现对任意字段的任意操作, 分别'
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
tip_pct | total_bill | ||||||
---|---|---|---|---|---|---|---|
count | mean | max | count | mean | max | ||
day | smoker | ||||||
Fri | No | 4 | 0.151650 | 0.187735 | 4 | 18.420000 | 22.75 |
Yes | 15 | 0.174783 | 0.263480 | 15 | 16.813333 | 40.17 | |
Sat | No | 45 | 0.158048 | 0.291990 | 45 | 19.661778 | 48.33 |
Yes | 42 | 0.147906 | 0.325733 | 42 | 21.276667 | 50.81 | |
Sun | No | 57 | 0.160113 | 0.252672 | 57 | 20.506667 | 48.17 |
Yes | 19 | 0.187250 | 0.710345 | 19 | 24.120000 | 45.35 | |
Thur | No | 45 | 0.160298 | 0.266312 | 45 | 17.113111 | 41.19 |
Yes | 17 | 0.163863 | 0.241255 | 17 | 19.190588 | 43.11 |
As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat to glue(粘合) the results together using the column names as the keys argument.
result['tip_pct'] # 多层索引的选取哦
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
count | mean | max | ||
---|---|---|---|---|
day | smoker | |||
Fri | No | 4 | 0.151650 | 0.187735 |
Yes | 15 | 0.174783 | 0.263480 | |
Sat | No | 45 | 0.158048 | 0.291990 |
Yes | 42 | 0.147906 | 0.325733 | |
Sun | No | 57 | 0.160113 | 0.252672 |
Yes | 19 | 0.187250 | 0.710345 | |
Thur | No | 45 | 0.160298 | 0.266312 |
Yes | 17 | 0.163863 | 0.241255 |
As befor, a list of tuples with custom names can be passed:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
tip_pct | total_bill | ||||
---|---|---|---|---|---|
Durchschnitt | Abweichung | Durchschnitt | Abweichung | ||
day | smoker | ||||
Fri | No | 0.151650 | 0.000791 | 18.420000 | 25.596333 |
Yes | 0.174783 | 0.002631 | 16.813333 | 82.562438 | |
Sat | No | 0.158048 | 0.001581 | 19.661778 | 79.908965 |
Yes | 0.147906 | 0.003767 | 21.276667 | 101.387535 | |
Sun | No | 0.160113 | 0.001793 | 20.506667 | 66.099980 |
Yes | 0.187250 | 0.023757 | 24.120000 | 109.046044 | |
Thur | No | 0.160298 | 0.001503 | 17.113111 | 59.625081 |
Yes | 0.163863 | 0.001551 | 19.190588 | 69.808518 |
Now suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:
grouped.agg({'tip':np.max, 'size': 'sum'})
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
tip | size | ||
---|---|---|---|
day | smoker | ||
Fri | No | 3.50 | 9 |
Yes | 4.73 | 31 | |
Sat | No | 9.00 | 115 |
Yes | 10.00 | 104 | |
Sun | No | 6.00 | 167 |
Yes | 6.50 | 49 | |
Thur | No | 6.70 | 112 |
Yes | 5.00 | 40 |
grouped.agg({'tip_pct':['min', 'max', 'mean', 'std', 'sum'],
'size':'sum'})
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
tip_pct | size | ||||||
---|---|---|---|---|---|---|---|
min | max | mean | std | sum | sum | ||
day | smoker | ||||||
Fri | No | 0.120385 | 0.187735 | 0.151650 | 0.028123 | 0.606602 | 9 |
Yes | 0.103555 | 0.263480 | 0.174783 | 0.051293 | 2.621746 | 31 | |
Sat | No | 0.056797 | 0.291990 | 0.158048 | 0.039767 | 7.112145 | 115 |
Yes | 0.035638 | 0.325733 | 0.147906 | 0.061375 | 6.212055 | 104 | |
Sun | No | 0.059447 | 0.252672 | 0.160113 | 0.042347 | 9.126438 | 167 |
Yes | 0.065660 | 0.710345 | 0.187250 | 0.154134 | 3.557756 | 49 | |
Thur | No | 0.072961 | 0.266312 | 0.160298 | 0.038774 | 7.213414 | 112 |
Yes | 0.090014 | 0.241255 | 0.163863 | 0.039389 | 2.785676 | 40 |
A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.
结果去掉行索引
- as_index=False
In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn't always describe, you can diable this behavior in most cases by passing as_index=False to groupby:
tips.groupby(['day', 'smoker'], as_index=False).mean()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
day | smoker | total_bill | tip | size | tip_pct | |
---|---|---|---|---|---|---|
0 | Fri | No | 18.420000 | 2.812500 | 2.250000 | 0.151650 |
1 | Fri | Yes | 16.813333 | 2.714000 | 2.066667 | 0.174783 |
2 | Sat | No | 19.661778 | 3.102889 | 2.555556 | 0.158048 |
3 | Sat | Yes | 21.276667 | 2.875476 | 2.476190 | 0.147906 |
4 | Sun | No | 20.506667 | 3.167895 | 2.929825 | 0.160113 |
5 | Sun | Yes | 24.120000 | 3.516842 | 2.578947 | 0.187250 |
6 | Thur | No | 17.113111 | 2.673778 | 2.488889 | 0.160298 |
7 | Thur | Yes | 19.190588 | 3.030000 | 2.352941 | 0.163863 |
Of course, it's always possible to obtain the result in this format by calling reset_index on the result. Using the as_index=False method avoids some unnecessary computations.
pandas 之 groupby 聚合函数的更多相关文章
- MySql查询语句中解决“该列没有包含在聚合函数或者groupby子句中”的相关问题方法
首先引入语句来源,表结构和数据如下: 需求是:查出员工(personname)在不同店铺(store)的总薪酬(salary),相同店铺输出store,不同店铺输出multi_store. 正确查询语 ...
- python处理数据的风骚操作[pandas 之 groupby&agg]
https://segmentfault.com/a/1190000012394176 介绍 每隔一段时间我都会去学习.回顾一下python中的新函数.新操作.这对于你后面的工作是有一定好处的.本文重 ...
- pandas分组和聚合
Pandas分组与聚合 分组 (groupby) 对数据集进行分组,然后对每组进行统计分析 SQL能够对数据进行过滤,分组聚合 pandas能利用groupby进行更加复杂的分组运算 分组运算过程:s ...
- pandas中数据聚合【重点】
数据聚合 数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值. 数据分类处理: 分组:先把数据分为几组 用函数处理:为不同组的数据应用不同的函数以转换数据 合并:把不同组得到的结果合 ...
- pandas之groupby分组与pivot_table透视表
zhuanzi: https://blog.csdn.net/qq_33689414/article/details/78973267 pandas之groupby分组与pivot_table透视表 ...
- DataFrame执行groupby聚合操作后,如何继续保持DataFrame对象而不变成Series对象
刚接触pandas不久,在处理特征时,碰到一个恶心的问题:用groupby聚合后,之前的dataframe对象变成了series对象,聚合的字段变成了索引index,导致获取这些字段时很麻烦,后面发现 ...
- Pandas之groupby分组
释义 groupby用来分组,调用groupby 之后返回pandas.core.groupby.generic.DataFrameGroupBy,其实就是由一个个格式为(key, 分组后的dataf ...
- Spark SQL 用户自定义函数UDF、用户自定义聚合函数UDAF 教程(Java踩坑教学版)
在Spark中,也支持Hive中的自定义函数.自定义函数大致可以分为三种: UDF(User-Defined-Function),即最基本的自定义函数,类似to_char,to_date等 UDAF( ...
- Elasticsearch JAVA api搞定groupBy聚合
本文给出如何使用Elasticsearch的Java API做类似SQL的group by聚合.为了简单起见,只给出一级groupby即group by field1(而不涉及到多级,例如group ...
随机推荐
- Centos 7使用docker部署LAMP搭建wordpress博客系统
0.简要概述 LAMP是目前比较流行的web框架,即Linux+Apache+Mysql+PHP的网站架构方案.docker是目前非常流行的虚拟化应用容器,可以为任何应用创建一个轻量级.可移植的容器. ...
- [BJOI2019]奥术神杖(AC自动机,DP,分数规划)
题目大意: 给出一个长度 $n$ 的字符串 $T$,只由数字和点组成.你可以把每个点替换成一个任意的数字.再给出 $m$ 个数字串 $S_i$,第 $i$ 个权值为 $t_i$. 对于一个替换方案,这 ...
- [LeetCode] 40. Combination Sum II 组合之和之二
Given a collection of candidate numbers (candidates) and a target number (target), find all unique c ...
- 公司ES升级带来的坑怎么填?
前言 公司的ES最近需要全部进行升级,目的是方便维护和统一管理.以前的版本不统一,这次准备统一升级到一个固定的版本. 同时还会给ES加上权限控制,虽然都是部署在内网,为了防止误操作,加上权限还是有必要 ...
- java循环控制语句loop使用
java中break和continue可以跳出指定循环,break和continue之后不加任何循环名则默认跳出其所在的循环,在其后加指定循环名,则可以跳出该指定循环(指定循环一般为循环嵌套的外循环) ...
- Kubernetes 学习(九)Kubernetes 源码阅读之正式篇------核心组件之 Scheduler
0. 前言 继续上一篇博客阅读 Kubernetes 源码,参照<k8s 源码阅读>首先学习 Kubernetes 的一些核心组件,首先是 kube-scheduler 本文严重参考原文: ...
- 熟悉使用ssm框架完成项目
羡慕那些一些博客就能写好多的人,总是能写的长篇大论的,而我就是简短的而且还伴随着语句不通顺等等,只写一点点,归根结底还是自己懒得写! 1.首先了解框架内容,拿到源码,先看配置文件 2.然后修改数据库建 ...
- 第24课 std::thread线程类及传参问题
一. std::thread类 (一)thread类摘要及分析 class thread { // class for observing and managing threads public: c ...
- 京东联盟开发(12)——删除MySQL表中重复记录并且只保留一条
本文介绍如何删除商品表中的一些重复记录. 有时,一条商品由于有多个skuid,比如某种手机有不同颜色,但价格.优惠等信息却是一致,导致其被多次收录.由于其各种条件基本类似,这样它在商品中多个sku都排 ...
- 深入理解C语言 - 指针使用的常见错误
在C语言中,指针的重要性不言而喻,但在很多时候指针又被认为是一把双刃剑.一方面,指针是构建数据结构和操作内存的精确而高效的工具.另一方面,它们又很容易误用,从而产生不可预知的软件bug.下面总结一下指 ...