pandas 之 groupby 聚合函数

import numpy as np

import pandas as pd

聚合函数

Aggregations refer to any data transformation that produces scalar values from arrays(输入是数组, 输出是标量值). The preceding examples have used several of them, including mean, count, min, and sum You may wonder what is going on when you invoke mean() on a GroupBy object, Many common aggregations such as those found in Table 10-1, have optimized implementations. However, you are not limited to only this set of methods.

count
sum
mean
median
std, var
min, max
prod
first, last

You can use aggregations of your own devising and additionally call any method that is also dedined on the grouped object.

For example, you might recall that quantile computes sample quantiles of a Series or a DataFrame.

While quantile is not explicitly implemented for GroupBy, it's a Series method an thus available for use. Internally, GroupBy efficiently slices up the Series, call piece.quantile(0.9) for each piece, and then assembles those results together into the result object:

您可以使用您自己设计的聚合，并额外调用在分组对象上也禁用的任何方法。例如，您可能还记得分位数计算序列或数据流的样本分位数。虽然分位数没有显式地为GroupBy实现，但它是一个系列方法，因此可以使用。在内部，GroupBy有效地分割该系列，为每个片段调用piece.quantile(0.9)，然后将这些结果组合到result对象中

df = pd.DataFrame({

    'key1': 'a a b b a'.split(),

    'key2': ['one', 'two', 'one', 'two', 'one'],

    'data1': np.random.randn(5),

    'data2': np.random.randn(5)

})

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2	data1	data2
0	a	one	1.296733	-0.756093
1	a	two	-1.389859	-1.027718
2	b	one	-0.846801	-0.802681
3	b	two	1.200620	-1.328187
4	a	one	0.002991	-1.223807

grouped = df.groupby('key1')

grouped['data1'].quantile(0.9)  # 0.9分位数

key1

a    1.037985

b    0.995878

Name: data1, dtype: float64

To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method

def peak_to_peak(arr):

    """计算数组的极差"""

    return arr.max() - arr.min()

grouped.agg(peak_to_peak)  # 计算各组类的极差, 类似apply

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		size	tip	tip_pct	total_bill
day	smoker
Fri	No	1	2.00	0.067349	10.29
Fri	Yes	3	3.73	0.159925	34.42
Sat	No	3	8.00	0.235193	41.08
Sat	Yes	4	9.00	0.290095	47.74
Sun	No	4	4.99	0.193226	39.40
Sun	Yes	3	5.00	0.644685	38.10
Thur	No	5	5.45	0.193350	33.68
Thur	Yes	2	3.00	0.151240	32.77

You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking(鸭子类型吗?):

grouped.describe()  # 描述分组信息, 只会对数值相关的哦

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

	data1								data2
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
key1
a	3.0	-0.030045	1.343601	-1.389859	-0.693434	0.002991	0.649862	1.296733	3.0	-1.002539	0.234871	-1.223807	-1.125762	-1.027718	-0.891905	-0.756093
b	2.0	0.176910	1.447745	-0.846801	-0.334946	0.176910	0.688765	1.200620	2.0	-1.065434	0.371589	-1.328187	-1.196811	-1.065434	-0.934057	-0.802681

I will explain in more detail what has happend here in Section 10.3 Apply: General split-apply-combine on page 302

列的多功能扩展

Let's return to the tipping dataset from earlier examples. After loading it with read_csv, we add a tipping percentage column tip_pct

新增一列, tip所占的百分比

tips = pd.read_csv('../examples/tips.csv')

tips.info()

tips.head()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 244 entries, 0 to 243

Data columns (total 6 columns):

total_bill    244 non-null float64

tip           244 non-null float64

smoker        244 non-null object

day           244 non-null object

time          244 non-null object

size          244 non-null int64

dtypes: float64(2), int64(1), object(3)

memory usage: 11.5+ KB

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size
0	16.99	1.01	No	Sun	Dinner	2
1	10.34	1.66	No	Sun	Dinner	3
2	21.01	3.50	No	Sun	Dinner	3
3	23.68	3.31	No	Sun	Dinner	2
4	24.59	3.61	No	Sun	Dinner	4

"新增一列 tip_pct"

tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips[:6]

'新增一列 tip_pct'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	total_bill	tip	smoker	day	time	size	tip_pct
0	16.99	1.01	No	Sun	Dinner	2	0.059447
1	10.34	1.66	No	Sun	Dinner	3	0.160542
2	21.01	3.50	No	Sun	Dinner	3	0.166587
3	23.68	3.31	No	Sun	Dinner	2	0.139780
4	24.59	3.61	No	Sun	Dinner	4	0.146808
5	25.29	4.71	No	Sun	Dinner	4	0.186240

As you've already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending o the column, or multiple functions at once. Fortunately, this is possible to do, which i'll illustrate through a number of examples. First, i'll group the tips by day and smoker:

对Series, DataFrame 的某列or所有的一个聚合, 通常会使用一个聚合函数去映射, 然而, 你可能想要总使用不同的函数去映射不同的列. 幸运的是，这是可以做到的，我将通过一些例子来说明这一点. 首先，我将把每天的小费和吸烟者分成两组

grouped = tips.groupby(['day', 'smoker'])

Note that for descriptive statistics like those in Table 10-1, you can pass the name of the function a s a string:

grouped_pct = grouped['tip_pct']

grouped_pct.agg('mean')

day   smoker

Fri   No        0.151650

      Yes       0.174783

Sat   No        0.158048

      Yes       0.147906

Sun   No        0.160113

      Yes       0.187250

Thur  No        0.160298

      Yes       0.163863

Name: tip_pct, dtype: float64

# cj  分组统计一波

grouped_pct.agg('count')

day   smoker

Fri   No         4

      Yes       15

Sat   No        45

      Yes       42

Sun   No        57

      Yes       19

Thur  No        45

      Yes       17

Name: tip_pct, dtype: int64

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

"对1or多个列, 进行1or多个聚合, 并排展开, 厉害了"

grouped_pct.agg(['mean', 'std', peak_to_peak])

'对1or多个列, 进行1or多个聚合, 并排展开, 厉害了'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		mean	std	peak_to_peak
day	smoker
Fri	No	0.151650	0.028123	0.067349
Fri	Yes	0.174783	0.051293	0.159925
Sat	No	0.158048	0.039767	0.235193
Sat	Yes	0.147906	0.061375	0.290095
Sun	No	0.160113	0.042347	0.193226
Sun	Yes	0.187250	0.154134	0.644685
Thur	No	0.160298	0.038774	0.193350
Thur	Yes	0.163863	0.039389	0.151240

Here we passed a list of aggregations functions to agg to evaluate indepedently on the data groups.

You don't need to accept the names that GroupBy gives to the columns; notably(尤其) lambda functions have the name <lambda which makes them hard to identify(you can see for yourself by looking at a function's __ name__ attribute.) Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names.

(you can think of a list of 2-tuple as an ordered mapping)

"给分组字段取别名"

grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

'给分组字段取别名'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		foo	bar
day	smoker
Fri	No	0.151650	0.028123
Fri	Yes	0.174783	0.051293
Sat	No	0.158048	0.039767
Sat	Yes	0.147906	0.061375
Sun	No	0.160113	0.042347
Sun	Yes	0.187250	0.154134
Thur	No	0.160298	0.038774
Thur	Yes	0.163863	0.039389

With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column (对不同的列进行不同的函数映射apply). To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

functions = ['count', 'mean', 'max']

"实现对任意字段的任意操作, 分别"

result = grouped['tip_pct', 'total_bill'].agg(functions)

result

'实现对任意字段的任意操作, 分别'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

		tip_pct			total_bill
		count	mean	max	count	mean	max
day	smoker
Fri	No	4	0.151650	0.187735	4	18.420000	22.75
Fri	Yes	15	0.174783	0.263480	15	16.813333	40.17
Sat	No	45	0.158048	0.291990	45	19.661778	48.33
Sat	Yes	42	0.147906	0.325733	42	21.276667	50.81
Sun	No	57	0.160113	0.252672	57	20.506667	48.17
Sun	Yes	19	0.187250	0.710345	19	24.120000	45.35
Thur	No	45	0.160298	0.266312	45	17.113111	41.19
Thur	Yes	17	0.163863	0.241255	17	19.190588	43.11

As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat to glue(粘合) the results together using the column names as the keys argument.

result['tip_pct']  # 多层索引的选取哦

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		count	mean	max
day	smoker
Fri	No	4	0.151650	0.187735
Fri	Yes	15	0.174783	0.263480
Sat	No	45	0.158048	0.291990
Sat	Yes	42	0.147906	0.325733
Sun	No	57	0.160113	0.252672
Sun	Yes	19	0.187250	0.710345
Thur	No	45	0.160298	0.266312
Thur	Yes	17	0.163863	0.241255

As befor, a list of tuples with custom names can be passed:

ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]

grouped['tip_pct', 'total_bill'].agg(ftuples)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

		tip_pct		total_bill
		Durchschnitt	Abweichung	Durchschnitt	Abweichung
day	smoker
Fri	No	0.151650	0.000791	18.420000	25.596333
Fri	Yes	0.174783	0.002631	16.813333	82.562438
Sat	No	0.158048	0.001581	19.661778	79.908965
Sat	Yes	0.147906	0.003767	21.276667	101.387535
Sun	No	0.160113	0.001793	20.506667	66.099980
Sun	Yes	0.187250	0.023757	24.120000	109.046044
Thur	No	0.160298	0.001503	17.113111	59.625081
Thur	Yes	0.163863	0.001551	19.190588	69.808518

Now suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:

grouped.agg({'tip':np.max, 'size': 'sum'})

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

		tip	size
day	smoker
Fri	No	3.50	9
Fri	Yes	4.73	31
Sat	No	9.00	115
Sat	Yes	10.00	104
Sun	No	6.00	167
Sun	Yes	6.50	49
Thur	No	6.70	112
Thur	Yes	5.00	40

grouped.agg({'tip_pct':['min', 'max', 'mean', 'std', 'sum'],

            'size':'sum'})

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

		tip_pct					size
		min	max	mean	std	sum	sum
day	smoker
Fri	No	0.120385	0.187735	0.151650	0.028123	0.606602	9
Fri	Yes	0.103555	0.263480	0.174783	0.051293	2.621746	31
Sat	No	0.056797	0.291990	0.158048	0.039767	7.112145	115
Sat	Yes	0.035638	0.325733	0.147906	0.061375	6.212055	104
Sun	No	0.059447	0.252672	0.160113	0.042347	9.126438	167
Sun	Yes	0.065660	0.710345	0.187250	0.154134	3.557756	49
Thur	No	0.072961	0.266312	0.160298	0.038774	7.213414	112
Thur	Yes	0.090014	0.241255	0.163863	0.039389	2.785676	40

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

结果去掉行索引

as_index=False

In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn't always describe, you can diable this behavior in most cases by passing as_index=False to groupby:

tips.groupby(['day', 'smoker'], as_index=False).mean()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	day	smoker	total_bill	tip	size	tip_pct
0	Fri	No	18.420000	2.812500	2.250000	0.151650
1	Fri	Yes	16.813333	2.714000	2.066667	0.174783
2	Sat	No	19.661778	3.102889	2.555556	0.158048
3	Sat	Yes	21.276667	2.875476	2.476190	0.147906
4	Sun	No	20.506667	3.167895	2.929825	0.160113
5	Sun	Yes	24.120000	3.516842	2.578947	0.187250
6	Thur	No	17.113111	2.673778	2.488889	0.160298
7	Thur	Yes	19.190588	3.030000	2.352941	0.163863

Of course, it's always possible to obtain the result in this format by calling reset_index on the result. Using the as_index=False method avoids some unnecessary computations.