import numpy as np
import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index

​ This "rotates" or pivots from the columns in the data to the rows.

unstack

​ This pivots from the rows into the columns.



I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number')) data

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack() result
'stack 将每一行, 叠成一个Series, 堆起来'

state     number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()
'unstack 将叠起来的Series, 变回DF'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
number one two three
state
Ohio 0 1 2
Colorado 3 4 5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
result.unstack(level='state')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5

Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2
one  a    0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack()  # 外连接哦

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
%time data2.unstack().stack()
Wall time: 5 ms

one  a    0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms

one  a    0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.unstack("state")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead tr th {
text-align: left;
} .dataframe thead tr:last-of-type th {
text-align: right;
}
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack('state').stack('side')
Wall time: 118 ms

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')

columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

# 修改列索引名
data = data.reindex(columns=columns) data.index = periods.to_timestamp('D', 'end') ldata = data.stack().reset_index().rename(columns={0:'value'}) ldata[:10]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
date item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
5 1959-06-30 unemp 5.100
6 1959-09-30 realgdp 2775.488
7 1959-09-30 infl 2.740
8 1959-09-30 unemp 5.300
9 1959-12-31 realgdp 2785.204

This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:

pivoted = ldata.pivot('date', 'item', 'value')

pivoted[:5]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
item infl realgdp unemp
date
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2

The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

ldata['valu2'] = np.random.randn(len(ldata))

ldata[:10]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
date item value valu2
0 1959-03-31 realgdp 2710.349 -0.143460
1 1959-03-31 infl 0.000 -0.422318
2 1959-03-31 unemp 5.800 0.389872
3 1959-06-30 realgdp 2778.801 -0.208526
4 1959-06-30 infl 2.340 -1.538956
5 1959-06-30 unemp 5.100 -0.143273
6 1959-09-30 realgdp 2775.488 0.385763
7 1959-09-30 infl 2.740 0.564365
8 1959-09-30 unemp 5.300 0.266295
9 1959-12-31 realgdp 2785.204 -1.267871

By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Wide to Long

An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:

df = pd.DataFrame({
'key': ['foo', 'bar', 'baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]
}) df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9

The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:

melted = pd.melt(df, ['key'])

melted

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9

Using pivot, we can reshape back to the original layout:(布局)

reshaped = melted.pivot('key', 'variable', 'value')

reshaped

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7

Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

reshaped.reset_index()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7

You can also specify a subset of columns to use as value columns:

pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6

pandas.melt can be used without any group identifiers, too:

pd.melt(df, value_vars=['A', 'B', 'C'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
pd.melt(df, value_vars=['key', 'A', 'B'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
variable value
0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6

小结

Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.

pandas 之 索引重塑的更多相关文章

  1. pandas重置索引的几种方法探究

    pandas重置索引的几种方法探究 reset_index() reindex() set_index() 函数名字看起来非常有趣吧! 不仅如此. 需要探究. http://nbviewer.jupy ...

  2. (三)pandas 层次化索引

    pandas层次化索引 1. 创建多层行索引 1) 隐式构造 最常见的方法是给DataFrame构造函数的index参数传递两个或更多的数组 Series也可以创建多层索引 import numpy ...

  3. pandas 数据索引与选取

    我们对 DataFrame 进行选择,大抵从这三个层次考虑:行列.区域.单元格.其对应使用的方法如下:一. 行,列 --> df[]二. 区域   --> df.loc[], df.ilo ...

  4. Pandas之索引

    Pandas的标签处理需要分成多种情况来处理,Series和DataFrame根据标签索引数据的操作方法是不同的,单列索引和双列索引的操作方法也是不同的. 单列索引 In [2]: import pa ...

  5. pandas重新索引

    #重新索引会更改DataFrame的行标签和列标签.重新索引意味着符合数据以匹配特定轴上的一组给定的标签. #可以通过索引来实现多个操作 - #重新排序现有数据以匹配一组新的标签. #在没有标签数据的 ...

  6. pandas DataFrame 索引(iloc 与 loc 的区别)

    Pandas--ix vs loc vs iloc区别 0. DataFrame DataFrame 的构造主要依赖如下三个参数: data:表格数据: index:行索引: columns:列名: ...

  7. Pandas重建索引

    重新索引会更改DataFrame的行标签和列标签.重新索引意味着符合数据以匹配特定轴上的一组给定的标签. 可以通过索引来实现多个操作 - 重新排序现有数据以匹配一组新的标签. 在没有标签数据的标签位置 ...

  8. pandas层级索引1

    层级索引(hierarchical indexing) 下面创建一个Series, 在输入索引Index时,输入了由两个子list组成的list,第一个子list是外层索引,第二个list是内层索引. ...

  9. pandas层级索引

    层级索引(hierarchical indexing) 下面创建一个Series, 在输入索引Index时,输入了由两个子list组成的list,第一个子list是外层索引,第二个list是内层索引. ...

随机推荐

  1. CSP 2019游记 & 退役记

    扶苏让我记录他AK CSP 的事实 ZAY NB!!! "你不配" 两年半的旅行结束了,我背着满满的行囊下了车,望着毫不犹豫远去的列车,我笑着哭了,笑着翻着我的行囊-- 游记 Da ...

  2. 洛谷 P5057 [CQOI2006]简单题 题解

    P5057 [CQOI2006]简单题 题目描述 有一个 n 个元素的数组,每个元素初始均为 0.有 m 条指令,要么让其中一段连续序列数字反转--0 变 1,1 变 0(操作 1),要么询问某个元素 ...

  3. 【题解】CF161B Discounts

    目录 题目 思路 \(Code\) 题目 CF161B Discounts 思路 贪心.很显然对于一个板凳(价格为c)所能使我们最多少花费\(\frac{c}{2}\)的金钱. 原因如下: 如果你将一 ...

  4. php获取客户端公网ip代码

    <?php /*如果是本地服务器获取客户端的ip地址是 127.0.0.1 如果是域名服务器获取客户端的是公网ip地址*/ function get_client_ip() { $ipaddre ...

  5. gamma测试报告

    Gamma阶段测试报告 测试计划及结果 我们针对测试做了比较多的改进. 测试代码分为针对纯java部分的单元测试和需要android运行环境的自动化仪器化测试 单元测试 这一部分基本继承Beta阶段的 ...

  6. 在centos系统的/etc/hosts添加了 当前主机的 ‘ NAT分配的IP controller’,RabbitMQ添加用户报错。

    在centos系统的/etc/hosts添加了 当前主机的 ' NAT分配的IP controller',RabbitMQ添加用户报错. rabbitMq添加用户 报错信息如下 [root@contr ...

  7. 【p6spy学习之一】p6spy使用

    一.介绍 p6spy是一个开源项目,通常使用它来跟踪数据库操作,查看程序运行过程中执行的sql语句.1.原理 p6spy将应用的数据源给劫持了,应用操作数据库其实在调用p6spy的数据源,p6spy劫 ...

  8. 如何快速将磁盘的MBR分区方式改成GPT分区方式

    注:修改分区格式时此硬盘不能是在使用状态(简单说就是不能出现在盘符中),如果在使用中先在计算机的磁盘管理中删除卷. 由于MBR分区表模式的硬盘最大只支持2T的硬盘空间,而现在我们的硬盘越来越大,有时候 ...

  9. [原创]K8Cscan4.0之Base64/HEX密码批量加密解密插件以及源码

    前言 今天抽空更新了Cscan,新增对C#编译的EXE动态调用,新增对PowerShell脚本动态调用(无论是否安装PowerShell) 增加一个字符串列表str.txt,用于存放任意字符串,比如帐 ...

  10. Matlab匿名函数

    Matlab可以通过function去定义一些功能函数,这使得代码变得简洁和高效.但是如果遇到的是一些简单的数学公式组成的函数表达式,继续用function去定义函数,似乎显得有些冗杂和多余.这时候, ...