1. import numpy as np
  2. import pandas as pd

Categorizing a dataset and applying a function to each group whether an aggregation(聚合) or transformation(转换), is often a critical(关键性的) component of a data analysis workflow.

(对数据集进行分类并将函数应用于每个组,无论是聚合还是转换,通常都是数据分析的关键组成部分)

After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purpose.

(加载、合并和准备数据集之后,可能需要计算组统计数据,或者可能需要数据透视表来进行报告或可视化.)

pandas provodes a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way.

One reason for the populatity of relational database SQL is the easy with wich data can be joined, filtered, transformed and aggregation.

(关系数据库SQL流行的一个原因是,它可以方便地连接、过滤、转换和聚合数据)

However, query language like SQL are somewhat constrained(受限于) in the kinds of group operations that can be perform. As you will see, with the expressiveness of Python and pandas, we can perform quite complex group operation by utilizing any function that accepts a pandas object or NumPy array. In this chapter, you will learn how to:

  • Split a pandas object into piece using one or more keys(in the form of functions, array, or DataFrame column names) 使用多个键将padnas对象分割
  • Calculate group summary statistics, like count, mean, or standard deviation, or a user-define function 计算组汇总统计信息,如计数、平均值、标准差或用户定义函数
  • Apply within-group transformations or other manipulations like normalization, linear regression, rank or subset selection.组内转换或其他操作,如标准化,线性回归,rank, 选取子集
  • Compute pivot talbe and cross-tabulations (交叉, 透视表)
  • Perform quantile analysis and other statistics group analyses 分位数统计和其他分析

Aggregationg of time series data, a special use case of groupby, is refered to as resampling(重采样) in this book and will receive separate treatment in Chapter 11

GroupBy 过程

  • key -> data -> split -> apply -> combine
  • cj 想到了大数据的 MapReduce

Hadley Wichham, an author of many popular package for the R programmng language, coine the term(提出了一个术语) split-apply-combine for describling group oprations.

In the first stage of process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide The splitting is performed on a praticular axis fo an object. For example, a DataFrame can be grouped on its rows(axis=0) or its columns(axis=1).

Once this done, a function is applied to each group, producing a new value.

Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what's being done to the data.

key -> data -> split -> apply -> combine

Each grouping key can take many forms, and the keys do not have to be all of the same type:

  • A list of array of values that is the same length as the axis being grouped (等轴长的列表作为分割key)
  • A value indicating a column name in a DataFrame (列字段分割)
  • A dict or Series giving a correspondence(一致) between the values on the axis being grouped and the group names (字典or Series)
  • A function to be invoked on axis index or the individual labels in the index (函数映射在轴索引上)

Note that the latter three methods are shortcuts for producing an array of values to be used to split up the object. Don't worry if this all seems abstract. Throughout this chapter, I will give many examples of all these methods. To get started, here is a small bablular dataset as a DataFrame:

  1. df = pd.DataFrame({
  2. 'key1': 'a a b b a'.split(),
  3. 'key2': ['one', 'two', 'one', 'two', 'one'],
  4. 'data1': np.random.randn(5),
  5. 'data2': np.random.randn(5)
  6. })
  7. df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
key1 key2 data1 data2
0 a one -2.043830 0.364327
1 a two -0.595880 1.066501
2 b one -0.706536 0.936099
3 b two -1.444520 -0.561796
4 a one -1.632010 -0.188685

Suppose you wanted to compute the mean of the data1 column using the lables from key1(以key1分组, 计算data1的均值). There are the number of ways to do this. One is to access data1 and call groupby with the column (s Series) at key1:

  1. grouped = df['data1'].groupby(df['key1'])
  2. grouped
  1. <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EDE9DDCB00>

This grouped variable is now a GropBy object. It has not actually computed anything except for some intermediate data about the group key df['key1']. The idea is that this object has all of the infomation needed to then apply some operation to each of the groups. For example, to compute group means we can call the GroupBy's mean method:

(groupby 会生成一个对象, 并不做计算, 计算需调用方法, 然后会将其映射到各个分组中)

  1. grouped.mean()
  1. key1
  2. a -1.423906
  3. b -1.075528
  4. Name: data1, dtype: float64

Later, I'll explain more about what happens when you call .mean(). The important things here is that the data (a Series) has been aggregate(聚合) according to the group key producing a new Series that is now indexed by unique values in the key1 column.

The result index has the name 'key1' because the DataFrame columns df['key1'] did.

If instead we had passed multiple arrays as list, we'd get something different:

  1. "多个键进行分组索引"
  2. means = df['data1'].groupby([df['key1'], df['key2']]).mean()
  3. means
  1. '多个键进行分组索引'
  2. key1 key2
  3. a one -1.837920
  4. two -0.595880
  5. b one -0.706536
  6. two -1.444520
  7. Name: data1, dtype: float64

Here we grouped the data using two keys, and the resulting Series now has a hierarchical consisting of the unique pairs of keys observed:

  1. "unstack => 将S -> Df"
  2. means.unstack()
  1. 'unstack => 将S -> Df'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
key2 one two
key1
a -1.837920 -0.59588
b -0.706536 -1.44452

In this example, the group keys are all Series, though they could be any arrays of the right length.

  1. "自定义分组"
  2. states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
  3. years = np.array([2005, 2005, 2006, 2005, 2006])
  4. df['data1'].groupby([states, years]).mean()
  1. '自定义分组'
  2. California 2005 -0.595880
  3. 2006 -0.706536
  4. Ohio 2005 -1.744175
  5. 2006 -1.632010
  6. Name: data1, dtype: float64

Frequently the grouping infomation is found in the same DataFrame as the data you want to work. In that case, you can pass column names(whether those are strings, numbers, or other Python objects) as the group keys:

通常,分组信息与要处理的数据位于相同的DaFrame中。在这种情况下,可以将列名(无论是字符串、数字还是其他Python对象)作为组键传递.

  1. df.groupby('key1').mean()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
data1 data2
key1
a -1.423906 0.414048
b -1.075528 0.187151
  1. df.groupby(['key1', 'key2']).mean()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
data1 data2
key1 key2
a one -1.837920 0.087821
two -0.595880 1.066501
b one -0.706536 0.936099
two -1.444520 -0.561796

You may have noticed in the first case df.groupby('key1').mean() that there is no key2 columns in the result. Because df['key2'] is not numeric data, it is said to be a nuisance column, which is therefore excluded from the result. By default, all of the numeric columns are aggregated, though it's possible to filter down to a subset, as you'll see soon.

Regardless of the objective in using groupby, a general useful GroupBy method is size which returns a Series containing group size.

  1. df.groupby(['key1', 'key2']).size() # 分组统计
  1. key1 key2
  2. a one 2
  3. two 1
  4. b one 1
  5. two 1
  6. dtype: int64

Group 对象的可迭代

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

(支持迭代, 生成包含组名和数据块的二元序列)

  1. for name, group in df.groupby('key1'):
  2. print(name)
  3. print(group)
  1. a
  2. key1 key2 data1 data2
  3. 0 a one -2.04383 0.364327
  4. 1 a two -0.59588 1.066501
  5. 4 a one -1.63201 -0.188685
  6. b
  7. key1 key2 data1 data2
  8. 2 b one -0.706536 0.936099
  9. 3 b two -1.444520 -0.561796

In the case of multiple keys, the first element in the tuple will be a tuple of key values.

(在多个键的情况下, 首元素将会被作为元组值的主键)

  1. for (k1, k2), group in df.groupby(['key1', 'key2']):
  2. print(k1, k2)
  3. print(group)
  1. a one
  2. key1 key2 data1 data2
  3. 0 a one -2.04383 0.364327
  4. 4 a one -1.63201 -0.188685
  5. a two
  6. key1 key2 data1 data2
  7. 1 a two -0.59588 1.066501
  8. b one
  9. key1 key2 data1 data2
  10. 2 b one -0.706536 0.936099
  11. b two
  12. key1 key2 data1 data2
  13. 3 b two -1.44452 -0.561796

Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dict of the data pieces as a one-line.

(按照字典分组)

  1. pieces = dict(list(df.groupby('key1')))
  2. pieces['b']

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
key1 key2 data1 data2
2 b one -0.706536 0.936099
3 b two -1.444520 -0.561796

By default groupby groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example df here by dtype like so:

  1. grouped = df.groupby('key1')
  1. for dtype, group in grouped:
  2. print(dtype)
  3. print(group)
  1. a
  2. key1 key2 data1 data2
  3. 0 a one -2.04383 0.364327
  4. 1 a two -0.59588 1.066501
  5. 4 a one -1.63201 -0.188685
  6. b
  7. key1 key2 data1 data2
  8. 2 b one -0.706536 0.936099
  9. 3 b two -1.444520 -0.561796

对分组对象选取子集

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:

  1. df.groupby('key1')['data1']
  2. df.groupby('key1')[['data2']]
  1. <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EDEEF248D0>
  2. <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001EDEEF24080>

Especially for large datasets, it may be desirable to aggregate only a few columns. For exmaple, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:

  1. df.groupby(['key1', 'key2'])[['data2']].mean()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
data2
key1 key2
a one 0.087821
two 1.066501
b one 0.936099
two -0.561796

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed or a grouped Series if only a single column name is passed as a scalar:

  1. s_grouped = df.groupby(['key1', 'key2'])['data2']
  1. s_grouped
  1. <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000001EDEEBD4320>
  1. s_grouped.mean()
  1. key1 key2
  2. a one 0.087821
  3. two 1.066501
  4. b one 0.936099
  5. two -0.561796
  6. Name: data2, dtype: float64

按字典or序列分组

Grouping information may exist in a form other than an array. Let's consider another example DataFrame:

  1. people = pd.DataFrame(np.random.randn(5, 5),
  2. columns=['a', 'b', 'c', 'd', 'e'],
  3. index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
  4. people.iloc[2:3, [1, 2]] = np.nan # 第3行, 第2,3列
  1. people # 前闭后开的还是

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
a b c d e
Joe -0.629941 -0.537660 0.392397 -0.489149 -1.668533
Steve -0.941174 0.926352 0.858621 -0.005732 0.289938
Wes -2.316073 NaN NaN 0.765298 0.107972
Jim 0.169023 -0.859168 -0.408575 0.928599 -1.519773
Travis 0.913253 0.851410 0.672238 -0.040455 0.722729

Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:

  1. mapping = {
  2. 'a':'red', 'b':'red', 'c':'blue',
  3. 'd':'blue', 'e':'red', 'f':'orange'
  4. }

Now, you could construct an array from this dict to pass to groupby , but instead we can just pass the dict

  1. by_column = people.groupby(mapping, axis=1)
  2. by_column.sum()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
blue red
Joe -0.096752 -2.836134
Steve 0.852889 0.275115
Wes 0.765298 -2.208101
Jim 0.520024 -2.209918
Travis 0.631783 2.487391

The same functionality holds for Series, which can be viewed as a fixed-size mappinf:

  1. map_series = pd.Series(mapping)
  2. map_series
  1. a red
  2. b red
  3. c blue
  4. d blue
  5. e red
  6. f orange
  7. dtype: object
  1. people.groupby(map_series, axis=1).count()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3

按函数分组

Using Python functions is a more generic way of defining a group mapping compared with a dict or Series. Any function passed as a group key will be called once per index value, with the return values being used as the group names. More concretely(具体地) consider the example DataFrame from the previous section, which has people's first names as index values. Suppose you wanted to group by the length of the names; while you could compute an array of string lengths, it's simpler to just pass the len function.

  1. people.groupby(len).sum()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
a b c d e
3 -2.776991 -1.396828 -0.016179 1.204748 -3.080333
5 -0.941174 0.926352 0.858621 -0.005732 0.289938
6 0.913253 0.851410 0.672238 -0.040455 0.722729

Mixing functions with arrays, dicts, or Series is not a problem as everything gets convrted to arrays interanlly:

  1. key_list = ['one', 'one', 'one', 'two', 'two']
  2. people.groupby([len, key_list]).min()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
a b c d e
3 one -2.316073 -0.537660 0.392397 -0.489149 -1.668533
two 0.169023 -0.859168 -0.408575 0.928599 -1.519773
5 one -0.941174 0.926352 0.858621 -0.005732 0.289938
6 two 0.913253 0.851410 0.672238 -0.040455 0.722729

按索引层次分组

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let's look at an example:

  1. "多层索引"
  2. columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
  3. [1, 3, 5, 1, 3]],
  4. names=['cty', 'tenor'])
  5. hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
  6. hier_df
  1. '多层索引'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead tr th {
  5. text-align: left;
  6. }
cty US JP
tenor 1 3 5 1 3
0 -1.215596 1.387763 2.339534 1.593265 -1.508100
1 1.519981 0.756772 0.855458 0.403545 0.538324
2 -2.079601 -0.487087 2.686048 1.471465 0.206721
3 -0.165285 0.374494 -1.994196 -0.345347 2.343612

To group by level, pass the level number or name using the level keyword:

  1. hier_df.groupby(level='cty', axis=1).count()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

  1. .dataframe tbody tr th {
  2. vertical-align: top;
  3. }
  4. .dataframe thead th {
  5. text-align: right;
  6. }
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3

pandas 之 group by 过程的更多相关文章

  1. pandas分组group

    Pandas对象可以分成任何对象.有多种方式来拆分对象,如 - obj.groupby(‘key’) obj.groupby([‘key1’,’key2’]) obj.groupby(key,axis ...

  2. pandas对excel处理过程中的总结

    在处理excel数据时需要将一组具有相同标签值的数据给按标签抽取出来,同样的标签值对应着同一个类别,这项操作让我对pandas的聚合功能有了更深刻的认识. 所谓聚合groupby,实际上是指将向量或者 ...

  3. Pandas Learning

    Panda Introduction Pandas 是基于 NumPy 的一个很方便的库,不论是对数据的读取.处理都非常方便.常用于对csv,json,xml等格式数据的读取和处理. Pandas定义 ...

  4. 用scikit-learn和pandas学习线性回归

    对于想深入了解线性回归的童鞋,这里给出一个完整的例子,详细学完这个例子,对用scikit-learn来运行线性回归,评估模型不会有什么问题了. 1. 获取数据,定义问题 没有数据,当然没法研究机器学习 ...

  5. scikit-learn 和pandas 基于windows单机机器学习环境的搭建

    很多朋友想学习机器学习,却苦于环境的搭建,这里给出windows上scikit-learn研究开发环境的搭建步骤. Step 1. Python的安装 python有2.x和3.x的版本之分,但是很多 ...

  6. MySQL Group Replication 技术点

    mysql group replication,组复制,提供了多写(multi-master update)的特性,增强了原有的mysql的高可用架构.mysql group replication基 ...

  7. 细细探究MySQL Group Replicaiton — 配置维护故障处理全集

             本文主要描述 MySQL Group Replication的简易原理.搭建过程以及故障维护管理内容.由于是新技术,未在生产环境使用过,本文均是虚拟机测试,可能存在考虑不周跟思路有误 ...

  8. 深入理解pandas读取excel,txt,csv文件等命令

    pandas读取文件官方提供的文档 在使用pandas读取文件之前,必备的内容,必然属于官方文档,官方文档查阅地址 http://pandas.pydata.org/pandas-docs/versi ...

  9. 教程 | 一文入门Python数据分析库Pandas

    首先要给那些不熟悉 Pandas 的人简单介绍一下,Pandas 是 Python 生态系统中最流行的数据分析库.它能够完成许多任务,包括: 读/写不同格式的数据 选择数据的子集 跨行/列计算 寻找并 ...

随机推荐

  1. First Chance Exception是什么?

    是否调试过应用程序并在输出窗口中看到有关“First Chance”异常的消息?有没有想过: 什么是First Chance Exception? 第一次机会异常是否意味着我的代码中存在问题? 在调试 ...

  2. [PHP] Elasticsearch 6.4.2 的安装和使用

    Elasticsearch 6.4.2 的安装和使用 一.安装http://www.ruanyifeng.com/blog/2017/08/elasticsearch.htmlhttps://www. ...

  3. webpack的一些坑

    最近自己着手做一个小的Demo需要webpack,目前版本号是4.41.2,想使用的版本是3.6.0,因3x版本和4x版本很多地方不同,所以在安装过程中也是很多坎坷,下面是遇到的一些坑,和一些解决办法 ...

  4. python client.py

    vi /Library/Frameworks/Python.framework/Versions//http/client.py vi /Library/Frameworks/Python.frame ...

  5. javaWeb项目配置自定义404错误页

        1.情景展示 为了隐藏tomcat版本信息以及显示更友好的错误信息提示,如何将404的错误跳转到指定页面? 2.解决方案 第一步:修改项目的web.xml 将如下代码添加到</web-a ...

  6. 【Qt开发】Qt5.9安装

    Qt5.9安装包整合了全部资源,包括所有可选的不同版本及编译器,不用再单独下载,虽然大了点,但方便了很多.有时可能需要用VS搭配Qt来使用,但有时又想用QtCreator+mingw or QtCre ...

  7. linux_problem

    今日自学遇到两个问题:火狐浏览器显示安全错误,按照国内网站上抄来抄去的解决办法并没有解决我的问题,即,每次访问新的网站都会提示"support mozilla.org 的管理员...&quo ...

  8. CentOS 7下JumpServer安装及配置

    环境 系统 # cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) # uname -r 3.10.0-693.21.1.el7. ...

  9. c++中如何判断sqlite表是否存在

    在项目中遇到需要判断sqlite数据库中某个表是否存在,上网搜索一些资料后,解决了问题,如下: 首先,在每个sqlite数据库中,都有一个名为sqlite_master的表,它定义了数据库的模式,它的 ...

  10. Google大数据三篇著名论文中文版

    Google File System中文版 Google MapReduce中文版 Google Bigtable中文版