pandas 之索引重塑

import numpy as np

import pandas as pd

There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

多层索引重塑

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack - 列拉长index

This "rotates" or pivots from the columns in the data to the rows.

unstack

This pivots from the rows into the columns.

I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

data = pd.DataFrame(np.arange(6).reshape((2, 3)),

    index=pd.Index(['Ohio', 'Colorado'], name='state'),

    columns=pd.Index(['one', 'two', 'three'],

    name='number'))

data

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

Using the stack method on this data pivots the columns into the rows, producing a Series.

"stack 将每一行, 叠成一个Series, 堆起来"

result = data.stack()

result

'stack 将每一行, 叠成一个Series, 堆起来'

state     number

Ohio      one       0

          two       1

          three     2

Colorado  one       3

          two       4

          three     5

dtype: int32

From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

"unstack 将叠起来的Series, 变回DF"

result.unstack()

'unstack 将叠起来的Series, 变回DF'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

result.unstack(level=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

result.unstack(level='state')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.

s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])

data2

one  a    0

     b    1

     c    2

     d    3

two  c    4

     d    5

     e    6

dtype: int64

data2.unstack()  # 外连接哦

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	a	b	c	d	e
one	0.0	1.0	2.0	3.0	NaN
two	NaN	NaN	4.0	5.0	6.0

%time data2.unstack().stack()

Wall time: 5 ms

one  a    0.0

     b    1.0

     c    2.0

     d    3.0

two  c    4.0

     d    5.0

     e    6.0

dtype: float64

%time data2.unstack().stack(dropna=False)

Wall time: 3 ms

one  a    0.0

     b    1.0

     c    2.0

     d    3.0

     e    NaN

two  a    NaN

     b    NaN

     c    4.0

     d    5.0

     e    6.0

dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

df = pd.DataFrame({'left': result, 'right': result + 5},

columns=pd.Index(['left', 'right'], name='side'))

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	side	left	right
state	number
Ohio	one	0	5
	two	1	6
	three	2	7
Colorado	one	3	8
	two	4	9
	three	5	10

df.unstack("state")

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead tr th {

    text-align: left;

}

.dataframe thead tr:last-of-type th {

    text-align: right;

}

side	left		right
state	Ohio	Colorado	Ohio	Colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

When calling stack, we can indicate the name of the axis to stack:

%time df.unstack('state').stack('side')

Wall time: 118 ms

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	state	Colorado	Ohio
number	side
one	left	3	0
one	right	8	5
two	left	4	1
two	right	9	6
three	left	5	2
three	right	10	7

长转宽形

A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:

%%time

data = pd.read_csv("../examples/macrodata.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 203 entries, 0 to 202

Data columns (total 14 columns):

year        203 non-null float64

quarter     203 non-null float64

realgdp     203 non-null float64

realcons    203 non-null float64

realinv     203 non-null float64

realgovt    203 non-null float64

realdpi     203 non-null float64

cpi         203 non-null float64

m1          203 non-null float64

tbilrate    203 non-null float64

unemp       203 non-null float64

pop         203 non-null float64

infl        203 non-null float64

realint     203 non-null float64

dtypes: float64(14)

memory usage: 22.3 KB

Wall time: 142 ms

data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')

columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')

# 修改列索引名

data = data.reindex(columns=columns)

data.index = periods.to_timestamp('D', 'end')

ldata = data.stack().reset_index().rename(columns={0:'value'})

ldata[:10]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	date	item	value
0	1959-03-31	realgdp	2710.349
1	1959-03-31	infl	0.000
2	1959-03-31	unemp	5.800
3	1959-06-30	realgdp	2778.801
4	1959-06-30	infl	2.340
5	1959-06-30	unemp	5.100
6	1959-09-30	realgdp	2775.488
7	1959-09-30	infl	2.740
8	1959-09-30	unemp	5.300
9	1959-12-31	realgdp	2785.204

This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:

pivoted = ldata.pivot('date', 'item', 'value')

pivoted[:5]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

item	infl	realgdp	unemp
date
1959-03-31	0.00	2710.349	5.8
1959-06-30	2.34	2778.801	5.1
1959-09-30	2.74	2775.488	5.3
1959-12-31	0.27	2785.204	5.6
1960-03-31	2.31	2847.699	5.2

The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

ldata['valu2'] = np.random.randn(len(ldata))

ldata[:10]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	date	item	value	valu2
0	1959-03-31	realgdp	2710.349	-0.143460
1	1959-03-31	infl	0.000	-0.422318
2	1959-03-31	unemp	5.800	0.389872
3	1959-06-30	realgdp	2778.801	-0.208526
4	1959-06-30	infl	2.340	-1.538956
5	1959-06-30	unemp	5.100	-0.143273
6	1959-09-30	realgdp	2775.488	0.385763
7	1959-09-30	infl	2.740	0.564365
8	1959-09-30	unemp	5.300	0.266295
9	1959-12-31	realgdp	2785.204	-1.267871

By omitting the last argument, you obtain a DataFrame with hierarchical columns:

Wide to Long

An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:

df = pd.DataFrame({

    'key': ['foo', 'bar', 'baz'],

    'A':[1,2,3],

    'B':[4,5,6],

    'C':[7,8,9]

})

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	A	B	C
0	foo	1	4	7
1	bar	2	5	8
2	baz	3	6	9

The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:

melted = pd.melt(df, ['key'])

melted

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6
6	foo	C	7
7	bar	C	8
8	baz	C	9

Using pivot, we can reshape back to the original layout:(布局)

reshaped = melted.pivot('key', 'variable', 'value')

reshaped

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

variable	A	B	C
key
bar	2	5	8
baz	3	6	9
foo	1	4	7

Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

reshaped.reset_index()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

variable	key	A	B	C
0	bar	2	5	8
1	baz	3	6	9
2	foo	1	4	7

You can also specify a subset of columns to use as value columns:

pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	variable	value
0	foo	A	1
1	bar	A	2
2	baz	A	3
3	foo	B	4
4	bar	B	5
5	baz	B	6

pandas.melt can be used without any group identifiers, too:

pd.melt(df, value_vars=['A', 'B', 'C'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	variable	value
0	A	1
1	A	2
2	A	3
3	B	4
4	B	5
5	B	6
6	C	7
7	C	8
8	C	9

pd.melt(df, value_vars=['key', 'A', 'B'])

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	variable	value
0	key	foo
1	key	bar
2	key	baz
3	A	1
4	A	2
5	A	3
6	B	4
7	B	5
8	B	6

小结

Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.