pandas 之数据合并

import numpy as np

import pandas as pd

Data contained in pandas objects can be combined together in a number of ways:

pandas.merge connects rows in DataFrame based on one or more keys. This will be familiar to users of SQL or other relational databases, as it impliemnts(工具) database join oprations.
pandas.concat concatenates or "stacks" together objects along an axis.
The combine_first instance method enables splicing(拼接) together overlapping data to fill in missing values in one object with values from another.

I will address each of these and give a number of examples. They'll be utilized in examples throughout the rest of the book.

SQL风格的Join

merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational database(e.g. SQL-based). The merge function in pandas is the main entry point for using theses algorithms on your data.

Let's start with a simple example:

df1 = pd.DataFrame({

    'key': 'b, b, a, c, a, a, b'.split(','),

    'data1': range(7)

})

df2 = pd.DataFrame({

    'key': ['a', 'b', 'd'],

    'data2': range(3)

})

df1

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data2
0	a	0
1	b	1
2	d	2

This is an example of a many to one join; the data in df1 has multiple rows labeled a and b, whereas(然而) df2 has only one row for each value in the key column. Calling merge with these objects we obtain:

"merge 默认是内连接, if 没有指定key..."

pd.merge(df1, df2)  # data1, key, data2

'merge 默认是内连接, if 没有指定key...'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1	data2
0	b	0	1

Note that I didn't specify which columns to join on. if that infomation is not specified, merge uses the overlapping columns names as keys. It's a good practice to specify explicitly, though:

(cj. 好像不是这样的哦)

"内连接走一波, 相同的记录才会保留哦, 跟作者的不一样"

pd.merge(df2, df1, on='key') # data1, key, data2

'内连接走一波, 相同的记录才会保留哦, 跟作者的不一样'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data2	data1
0	b	1	0

# cj test

pd.merge(df1, df2, on='key', how='left')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1	data2
0	b	0	1.0
1	b	1	NaN
2	a	2	NaN
3	c	3	NaN
4	a	4	NaN
5	a	5	NaN
6	b	6	NaN

If the column names are different in each object, you can specify them separately:

(两个df的键不同, 进行合并时可以分别指定)

df3 = pd.DataFrame({

    'lkey': 'a b a c a a b'.split(),

    'data1': range(7)

})

df4 = pd.DataFrame({

    'rkey': ['a', 'b', 'd'],

    'data2': range(3)

})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	lkey	data1	rkey	data2
0	a	0	a	0
1	a	2	a	0
2	a	4	a	0
3	a	5	a	0
4	b	1	b	1
5	b	6	b	1

You may notice that the 'c' and 'd' values and associate data are missing from the result. By defualt merge does an inner join; the keys in the result are intersection. or the common set found in both tables. Other possible options are left, right and outer. The outer join takes the union of the keys, combining the effect of applying both left and right joins.

(merge 默认是内连接, 相关的还有左, 右, 外连接;

外连接是包含了左,右连接哦)

"默认以所有的键, 其实就是穷举所有的可能结果而已"

pd.merge(df1, df2, how='outer')

'默认以所有的键, 其实就是穷举所有的可能结果而已'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1	data2
0	b	0.0	1.0
1	b	1.0	NaN
2	b	6.0	NaN
3	a	2.0	NaN
4	a	4.0	NaN
5	a	5.0	NaN
6	c	3.0	NaN
7	a	NaN	0.0
8	d	NaN	2.0

See Table 8-1 for a summary of the options for how.

Option	Behavior
'inner'	Use only the key combinations observed in both tables
'left'	Use all combinations found in the left table
'right'	Use all key combinations found in the right table
'outer'	Use all key combinations observed in both tables together

Many-to-Many merges have well-defined, though not necessarily intuitive(直觉的), behavior. Here's an example:

df1 = pd.DataFrame({

    'key': 'b b a c a b'.split(),

    'data1': range(6)

})

df2 = pd.DataFrame({

    'key': 'a b a b d'.split(),

    'data2': range(5)

})

df1

df2

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data2
0	a	0
1	b	1
2	a	2
3	b	3
4	d	4

pd.merge(df1, df2, how='inner')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1	data2
0	b	0	1
1	b	0	3
2	b	1	1
3	b	1	3
4	b	5	1
5	b	5	3
6	a	2	0
7	a	2	2
8	a	4	0
9	a	4	2

pd.merge(df1, df2, on='key', how='left')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	data1	data2
0	b	0	1.0
1	b	0	3.0
2	b	1	1.0
3	b	1	3.0
4	a	2	0.0
5	a	2	2.0
6	c	3	NaN
7	a	4	0.0
8	a	4	2.0
9	b	5	1.0
10	b	5	3.0

To merge with multiple keys, pass a list of columns names:

left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],

    'key2': ['one', 'two', 'one'],

    'lval': [1, 2, 3]})

right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],

    'key2': ['one', 'one', 'one', 'two'],

    'rval': [4, 5, 6, 7]})

left

right

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2	lval
0	foo	one	1
1	foo	two	2
2	bar	one	3

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2	rval
0	foo	one	4
1	foo	one	5
2	bar	one	6
3	bar	two	7

"outer 所有可能的结果, 支持多个keys"

pd.merge(left, right, on=['key1', 'key2'], how='outer')

'outer 所有可能的结果, 支持多个keys'

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array fo tuples to be used as a single join key.

When you are joining columns-on-columns, the indexes on the passed DataFrame objects are discarded.

pd.merge(left, right, on='key1')

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

See Table 8-2 for an argument reference on merge. Joining using the DataFrame's row index is the subject of the next section.

left
right
how
on
left_on
right_on
left_index
right_index
sort
suffixes 添加后缀
copy
indecator

按Index合并

In some cases, the merge key(s) in a DataFrame will be found on its index, In this case, you can pass left_index=True or right_index=True to indicate that the index should be used as the merge key:

left1 = pd.DataFrame({

    'key': ['a', 'b', 'a', 'a', 'b', 'c'],

    'value': range(6)

})

right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a', 'b'])

left1

right1

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	group_val
a	3.5
b	7.0

pd.merge(left1, right1, left_on='key', right_index=True)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {

    vertical-align: top;

}

.dataframe thead th {

    text-align: right;

}

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0

按轴水平/垂直合并

Another kind of data combination operation is referred to interchangeably as concat-enation, binding, or stacking, NumPy's concatenate function can do this with NumPy arrays:

arr = np.arange(12).reshape((3,4))

arr

array([[ 0,  1,  2,  3],

       [ 4,  5,  6,  7],

       [ 8,  9, 10, 11]])

"直接水平拼接"

np.concatenate([arr, arr], axis=1)

'直接水平拼接'

array([[ 0,  1,  2,  3,  0,  1,  2,  3],

       [ 4,  5,  6,  7,  4,  5,  6,  7],

       [ 8,  9, 10, 11,  8,  9, 10, 11]])

不再继续往下扩展了, 就目前我工作中用得最多的还是Merge, Join,在处理表vlookup的场景下.还有就是涉及垂直/水平拼接的 pd.concat(), np.vstack() 和 np.hstack(), 结合SQL来配合使用,就非常灵活和高效了.