Pandas高级教程之:GroupBy用法

简介

pandas中的DF数据类型可以像数据库表格一样进行groupby操作。通常来说groupby操作可以分为三部分：分割数据，应用变换和和合并数据。

本文将会详细讲解Pandas中的groupby操作。

分割数据

分割数据的目的是将DF分割成为一个个的group。为了进行groupby操作，在创建DF的时候需要指定相应的label：

df = pd.DataFrame(

   ...:     {

   ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],

   ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],

   ...:         "C": np.random.randn(8),

   ...:         "D": np.random.randn(8),

   ...:     }

   ...: )

   ...:

df

Out[61]:

     A      B         C         D

0  foo    one -0.490565 -0.233106

1  bar    one  0.430089  1.040789

2  foo    two  0.653449 -1.155530

3  bar  three -0.610380 -0.447735

4  foo    two -0.934961  0.256358

5  bar    two -0.256263 -0.661954

6  foo    one -1.132186 -0.304330

7  foo  three  2.129757  0.445744

默认情况下，groupby的轴是x轴。可以一列group，也可以多列group：

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

多index

在0.24版本中，如果我们有多index，可以从中选择特定的index进行group：

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()

Out[12]:

            C         D

A

bar -1.591710 -1.739537

foo -0.752861 -1.402938

get_group

get_group 可以获取分组之后的数据：

In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})

In [25]: df3.groupby(["X"]).get_group("A")

Out[25]:

   X  Y

0  A  1

2  A  3

In [26]: df3.groupby(["X"]).get_group("B")

Out[26]:

   X  Y

1  B  4

3  B  2

dropna

默认情况下，NaN数据会被排除在groupby之外，通过设置 dropna=False 可以允许NaN数据：

In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]

In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])

In [29]: df_dropna

Out[29]:

   a    b  c

0  1  2.0  3

1  1  NaN  4

2  2  1.0  3

3  1  2.0  2

# Default ``dropna`` is set to True, which will exclude NaNs in keys

In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()

Out[30]:

     a  c

b

1.0  2  3

2.0  2  5

# In order to allow NaN in keys, set ``dropna`` to False

In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()

Out[31]:

     a  c

b

1.0  2  3

2.0  2  5

NaN  1  4

groups属性

groupby对象有个groups属性，它是一个key-value字典，key是用来分类的数据，value是分类对应的值。

In [34]: grouped = df.groupby(["A", "B"])

In [35]: grouped.groups

Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}

In [36]: len(grouped)

Out[36]: 6

index的层级

对于多级index对象，groupby可以指定group的index层级：

In [40]: arrays = [

   ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],

   ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],

   ....: ]

   ....: 

In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [42]: s = pd.Series(np.random.randn(8), index=index)

In [43]: s

Out[43]:

first  second

bar    one      -0.919854

       two      -0.042379

baz    one       1.247642

       two      -0.009920

foo    one       0.290213

       two       0.495767

qux    one       0.362949

       two       1.548106

dtype: float64

group第一级：

In [44]: grouped = s.groupby(level=0)

In [45]: grouped.sum()

Out[45]:

first

bar   -0.962232

baz    1.237723

foo    0.785980

qux    1.911055

dtype: float64

group第二级：

In [46]: s.groupby(level="second").sum()

Out[46]:

second

one    0.980950

two    1.991575

dtype: float64

group的遍历

得到group对象之后，我们可以通过for语句来遍历group：

In [62]: grouped = df.groupby('A')

In [63]: for name, group in grouped:

   ....:     print(name)

   ....:     print(group)

   ....:

bar

     A      B         C         D

1  bar    one  0.254161  1.511763

3  bar  three  0.215897 -0.990582

5  bar    two -0.077118  1.211526

foo

     A      B         C         D

0  foo    one -0.575247  1.346061

2  foo    two -1.143704  1.627081

4  foo    two  1.193555 -0.441652

6  foo    one -0.408530  0.268520

7  foo  three -0.862495  0.024580

如果是多字段group，group的名字是一个元组：

In [64]: for name, group in df.groupby(['A', 'B']):

   ....:     print(name)

   ....:     print(group)

   ....:

('bar', 'one')

     A    B         C         D

1  bar  one  0.254161  1.511763

('bar', 'three')

     A      B         C         D

3  bar  three  0.215897 -0.990582

('bar', 'two')

     A    B         C         D

5  bar  two -0.077118  1.211526

('foo', 'one')

     A    B         C         D

0  foo  one -0.575247  1.346061

6  foo  one -0.408530  0.268520

('foo', 'three')

     A      B         C        D

7  foo  three -0.862495  0.02458

('foo', 'two')

     A    B         C         D

2  foo  two -1.143704  1.627081

4  foo  two  1.193555 -0.441652

聚合操作

分组之后，就可以进行聚合操作：

In [67]: grouped = df.groupby("A")

In [68]: grouped.aggregate(np.sum)

Out[68]:

            C         D

A

bar  0.392940  1.732707

foo -1.796421  2.824590

In [69]: grouped = df.groupby(["A", "B"])

In [70]: grouped.aggregate(np.sum)

Out[70]:

                  C         D

A   B

bar one    0.254161  1.511763

    three  0.215897 -0.990582

    two   -0.077118  1.211526

foo one   -0.983776  1.614581

    three -0.862495  0.024580

    two    0.049851  1.185429

对于多index数据来说，默认返回值也是多index的。如果想使用新的index，可以添加 as_index = False：

In [71]: grouped = df.groupby(["A", "B"], as_index=False)

In [72]: grouped.aggregate(np.sum)

Out[72]:

     A      B         C         D

0  bar    one  0.254161  1.511763

1  bar  three  0.215897 -0.990582

2  bar    two -0.077118  1.211526

3  foo    one -0.983776  1.614581

4  foo  three -0.862495  0.024580

5  foo    two  0.049851  1.185429

In [73]: df.groupby("A", as_index=False).sum()

Out[73]:

     A         C         D

0  bar  0.392940  1.732707

1  foo -1.796421  2.824590

上面的效果等同于reset_index

In [74]: df.groupby(["A", "B"]).sum().reset_index()

grouped.size() 计算group的大小：

In [75]: grouped.size()

Out[75]:

     A      B  size

0  bar    one     1

1  bar  three     1

2  bar    two     1

3  foo    one     2

4  foo  three     1

5  foo    two     2

grouped.describe() 描述group的信息：

In [76]: grouped.describe()

Out[76]:

      C                                                    ...         D

  count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max

0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763

1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582

2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526

3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061

4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580

5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081

[6 rows x 16 columns]

通用聚合方法

下面是通用的聚合方法：

函数	描述
`mean()`	平均值
`sum()`	求和
`size()`	计算size
`count()`	group的统计
`std()`	标准差
`var()`	方差
`sem()`	均值的标准误
`describe()`	统计信息描述
`first()`	第一个group值
`last()`	最后一个group值
`nth()`	第n个group值
`min()`	最小值
`max()`	最大值

同时使用多个聚合方法

可以同时指定多个聚合方法：

In [81]: grouped = df.groupby("A")

In [82]: grouped["C"].agg([np.sum, np.mean, np.std])

Out[82]:

          sum      mean       std

A

bar  0.392940  0.130980  0.181231

foo -1.796421 -0.359284  0.912265

可以重命名：

In [84]: (

   ....:     grouped["C"]

   ....:     .agg([np.sum, np.mean, np.std])

   ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})

   ....: )

   ....:

Out[84]:

          foo       bar       baz

A

bar  0.392940  0.130980  0.181231

foo -1.796421 -0.359284  0.912265

NamedAgg

NamedAgg 可以对聚合进行更精准的定义，它包含 column 和aggfunc 两个定制化的字段。

In [88]: animals = pd.DataFrame(

   ....:     {

   ....:         "kind": ["cat", "dog", "cat", "dog"],

   ....:         "height": [9.1, 6.0, 9.5, 34.0],

   ....:         "weight": [7.9, 7.5, 9.9, 198.0],

   ....:     }

   ....: )

   ....: 

In [89]: animals

Out[89]:

  kind  height  weight

0  cat     9.1     7.9

1  dog     6.0     7.5

2  cat     9.5     9.9

3  dog    34.0   198.0

In [90]: animals.groupby("kind").agg(

   ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),

   ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),

   ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),

   ....: )

   ....:

Out[90]:

      min_height  max_height  average_weight

kind

cat          9.1         9.5            8.90

dog          6.0        34.0          102.75

或者直接使用一个元组：

In [91]: animals.groupby("kind").agg(

   ....:     min_height=("height", "min"),

   ....:     max_height=("height", "max"),

   ....:     average_weight=("weight", np.mean),

   ....: )

   ....:

Out[91]:

      min_height  max_height  average_weight

kind

cat          9.1         9.5            8.90

dog          6.0        34.0          102.75

不同的列指定不同的聚合方法

通过给agg方法传入一个字典，可以指定不同的列使用不同的聚合：

In [95]: grouped.agg({"C": "sum", "D": "std"})

Out[95]:

            C         D

A

bar  0.392940  1.366330

foo -1.796421  0.884785

转换操作

转换是将对象转换为同样大小对象的操作。在数据分析的过程中，经常需要进行数据的转换操作。

可以接lambda操作：

In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

填充na值：

In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

过滤操作

filter方法可以通过lambda表达式来过滤我们不需要的数据：

In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)

Out[137]:

3    3

4    3

5    3

dtype: int64

Apply操作

有些数据可能不适合进行聚合或者转换操作，Pandas提供了一个 apply 方法，用来进行更加灵活的转换操作。

In [156]: df

Out[156]:

     A      B         C         D

0  foo    one -0.575247  1.346061

1  bar    one  0.254161  1.511763

2  foo    two -1.143704  1.627081

3  bar  three  0.215897 -0.990582

4  foo    two  1.193555 -0.441652

5  bar    two -0.077118  1.211526

6  foo    one -0.408530  0.268520

7  foo  three -0.862495  0.024580

In [157]: grouped = df.groupby("A")

# could also just call .describe()

In [158]: grouped["C"].apply(lambda x: x.describe())

Out[158]:

A

bar  count    3.000000

     mean     0.130980

     std      0.181231

     min     -0.077118

     25%      0.069390

                ...

foo  min     -1.143704

     25%     -0.862495

     50%     -0.575247

     75%     -0.408530

     max      1.193555

Name: C, Length: 16, dtype: float64

可以外接函数：

In [159]: grouped = df.groupby('A')['C']

In [160]: def f(group):

   .....:     return pd.DataFrame({'original': group,

   .....:                          'demeaned': group - group.mean()})

   .....: 

In [161]: grouped.apply(f)

Out[161]:

   original  demeaned

0 -0.575247 -0.215962

1  0.254161  0.123181

2 -1.143704 -0.784420

3  0.215897  0.084917

4  1.193555  1.552839

5 -0.077118 -0.208098

6 -0.408530 -0.049245

7 -0.862495 -0.503211

本文已收录于 http://www.flydean.com/11-python-pandas-groupby/

最通俗的解读，最深刻的干货，最简洁的教程，众多你不知道的小技巧等你来发现！

Pandas高级教程之:GroupBy用法的更多相关文章

Pandas高级教程之:category数据类型
目录简介创建category 使用Series创建使用DF创建创建控制转换为原始类型 categories的操作获取category的属性重命名categories 使用add_cate ...
Pandas高级教程之:window操作
目录简介滚动窗口 Center window Weighted window 加权窗口扩展窗口指数加权窗口简介在数据统计中,经常需要进行一些范围操作,这些范围我们可以称之为一个window ...
Pandas高级教程之:Dataframe的合并
目录简介使用concat 使用append 使用merge 使用join 覆盖数据简介 Pandas提供了很多合并Series和Dataframe的强大的功能,通过这些功能可以方便的进行数据分析 ...
Pandas高级教程之:处理text数据
目录简介创建text的DF String 的方法 columns的String操作分割和替换String String的连接使用 .str来index extract extractall c ...
Pandas高级教程之:处理缺失数据
目录简介 NaN的例子整数类型的缺失值 Datetimes 类型的缺失值 None 和 np.nan 的转换缺失值的计算使用fillna填充NaN数据使用dropna删除包含NA的数据插值 ...
Pandas高级教程之:plot画图详解
目录简介基础画图其他图像 bar stacked bar barh Histograms box Area Scatter Hexagonal bin Pie 在画图中处理NaN数据其他作图工 ...
Pandas高级教程之:统计方法
目录简介变动百分百 Covariance协方差 Correlation相关系数 rank等级简介数据分析中经常会用到很多统计类的方法,本文将会介绍Pandas中使用到的统计方法. 变动百分百 ...
Pandas高级教程之:稀疏数据结构
目录简介 Spare data的例子 SparseArray SparseDtype Sparse的属性 Sparse的计算 SparseSeries 和 SparseDataFrame 简介如果 ...
Pandas高级教程之:自定义选项
目录简介常用选项 get/set 选项经常使用的选项最大展示行数超出数据展示最大列的宽度显示精度零转换的门槛列头的对齐方向简介 pandas有一个option系统可以控制panda ...

随机推荐

Centos 7.4搭建es7.12.0+Skywalking7.8.5
Skywalking整体架构图和分布式追踪系统原理:https://blog.csdn.net/weixin_39866487/article/details/111581322 软件包版本1.ela ...
shell基础之case应用
在server0上穿件一个名为/root/script.sh的脚本,让其提供给下列的特性 1.当运行/root/script.sh all,输出为none 2.当运行/root/script.s ...
测试开发：从0到1学习如何测试API网关
本文来自我的一名学员分享日常工作中,难免会遇到临危受命的情况,虽然没有这么夸张,但是也可能会接到一个陌生的任务,也许只是对这个概念有所耳闻.也许这个时候会感到一丝的焦虑,生怕没法完成领导交给的测试任 ...
在 Android 使用 QuickJS JavaScript 引擎教程
quickjs-android 是 QuickJS JavaScript 引擎的 Android 接口框架,整体基于面向对象设计,提供了自动GC功能,使用简单.armeabi-v7a 的大小仅 350 ...
Jmeter- 笔记7 - 服务器监控（ServerAgent配置）
文件:ServerAgent - 2.2.3.zip 放网盘了在服务器的操作:只需要把这个文件上传到被监控服务器,然后解压,启动sh startagent.sh --udp-port 0 --tc ...
Python+Selenium学习笔记15 - 读取txt和csv文件
读取txt的内容并用百度查找搜索 1 # coding = utf-8 2 3 from selenium import webdriver 4 import time 5 6 # 打开浏览器 7 d ...
基础BaseController
1.依赖状态枚举 /** * status enum */ public enum Status { SUCCESS(0, "success", "成功"), ...
Vue中的三种Watcher
Vue中的三种Watcher Vue可以说存在三种watcher,第一种是在定义data函数时定义数据的render watcher:第二种是computed watcher,是computed函数在 ...
CVPR2019：无人驾驶3D目标检测论文点评
CVPR2019:无人驾驶3D目标检测论文点评重读CVPR2019的文章,现在对以下文章进行点评. Stereo R-CNN based 3D Object Detection for Autono ...
特斯拉Tesla Model 3整体架构解析（上）
特斯拉Tesla Model 3整体架构解析(上) 一辆特斯拉 Model 3型车在硬件改造后解体 Sensors for ADAS applications 特斯拉 Model 3型设计的传感器组件 ...