Python数据分析之Pandas操作大全

从头到尾都是手码的，文中的所有示例也都是在Pycharm中运行过的，自己整理笔记的最大好处在于可以按照自己的思路来构建框架，等到将来在需要的时候能够以最快的速度看懂并应用=_=

注：为方便表述，本章设s为pandas.core.series.Series的一个实例化对象，设df为pandas.core.frame.DataFrame的一个实例化对象

1. Pandas简介

Pandas是基于NumPy的python数据分析库，最初被作为金融数据分析工具而开发出来，因此Pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据（panel data）和python数据分析（data analysis）。panel data是经济学中关于多维数据集的一个术语，在Pandas中也提供了Panel的数据类型（注：在最新版本的Pandas中已将该数据类型删除）。

官网：https://pandas.pydata.org/

Pandas是构建在Numpy的基础上的，所以我们在加载pandas之前，最好先把Numpy也加载进来：

import numpy as np
import pandas as pd

2. Pandas中的三大数据类型

在Pandas中有下面三种数据结构：Series、DataFrame、Panel。

数据结构	维数	说明
Series	1
Dataframe	2	是Series的容器
Panel	3	是Dataframe的容器（注：在最新版本的Pandas中已将Panel数据类型删除）

3. Series

Series具有一维的数据结构，它拥有一列index和一列values，每个Series都是pandas.core.series.Series的一个实例化对象。

本节仅讨论具有一维index结构的Series，关于具有多维index结构的Series，见本章“13.层次化索引”

（1）创建Series

语法：pd.Series(data=None, index=None, dtype=None, name=None)

参数：

data：数据，它可以是一维list、dict、range()或一维numpy.ndarray。data的默认值为None，此时会创建一个空的Series([])
index：索引，默认值为None（当data为非dict类型时，默认索引为0、1、2……；当data为dict类型时，默认索引为dict的键）

可以使用一个list、tuple、range()或numpy.ndarray自定义索引。注意：

① index长度必须与data长度相等，否则报错；

② 当data为dict类型时，不可以再使用自定义index，否则自定义index不仅会覆盖掉字典的键，还会让Series的所有值变为NaN

③ 默认的index（0、1、2……）称为position（位置），自定义的index称为label（标签）。未定义index时，只能通过position取值；定义index后，既可以通过position取值，也可以通过label取值。

④ 本节仅讨论具有一维index结构的Series，关于具有多维index结构的Series，见本章“13.层次化索引”
dtype：数据类型
name：Series的名字

注意：和DataFrame不同，Series没有columns参数！

# 通过list和numpy.ndarray创建Series
import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3])
s2 = pd.Series(np.array([1,2,3]), index=['a','b','c'], name='MySeries')
print(s1); print('===========')
print(type(s1)); print('===========')
print(s2); print('===========')
print(type(s2))
执行结果：
0    1
1    2
2    3
dtype: int64
===========
<class 'pandas.core.series.Series'>
===========
a    1
b    2
c    3
Name: MySeries, dtype: int32
===========
<class 'pandas.core.series.Series'>

# 通过dict创建Series
import numpy as np
import pandas as pd
s1 = pd.Series({'a':1,'b':2,'c':3})
s2 = pd.Series({'a':1,'b':2,'c':3}, index=['A','B','C'])		# 错误的定义方式
print(s1); print('===========')
print(type(s1)); print('===========')
print(s2); print('===========')
print(type(s2))
执行结果：
a    1
b    2
c    3
dtype: int64
===========
<class 'pandas.core.series.Series'>
===========
A   NaN
B   NaN
C   NaN
dtype: float64
===========
<class 'pandas.core.series.Series'>

（2）Series的向量化和广播

① 向量化： Series与一维对象进行计算

这个计算有个前提，即一维对象的长度等于Series的长度，该一维对象可以是list, numpy.ndarray……在满足这个前提的情况下（不满足则报错），会进行元素级操作，相同位置的元素按照某种运算规则进行运算，并返回一个与原Series索引相同、相同长度的Series

import numpy as np
import pandas as pd
s = pd.Series([10,11,12],index=['a','b','c'])
arr = np.array([9,11,13])
print(s + arr)
print('===========')
print(s > arr)
执行结果：
a    19
b    22
c    25
dtype: int64
===========
a     True
b    False
c    False
dtype: bool

② 广播：Series与数字进行计算

对于Series与数字进行的+、-、*、/、**、//、%、>、<、>=、<=、==、!=等运算，会将这个Series中的每一个值均与这个数字进行计算，并用这些结果组成一个与原Series结构相同的Series

import numpy as np
import pandas as pd
s = pd.Series([10,11,12],index=['a','b','c'])
print(s+2)
print('===========')
print(s>11)
执行结果：
a    12
b    13
c    14
dtype: int64
===========
a    False
b    False
c     True
dtype: bool

③ Series与多维对象进行计算（不支持）

Series不支持与多维对象进行计算，即一个多维numpy.ndarray不支持Series对其进行广播

import numpy as np
import pandas as pd
arr = np.array([[9,11,13],[8,15,10],[7,6,16]])
s = pd.Series([10,11,12])
print(arr + s)
print(arr > s)
执行结果：报错

（3）Series的索引和切片

s[0]：基于position（位置）的索引

s['a']：基于label（标签）的索引

s[1:3]：基于position的切片，顾前不顾后

s['b':'d']：基于label的切片，前后都包含

s[s>5]：先通过广播获得值为bool的Series，然后再筛选其中值为True的项构建新的Series（类似布尔值索引）

s[[3,1,2]]、s[['e','b','d']]：通过list实现不连续索引（类似花式索引）

s.loc[]：与df.loc[]用法相似

s.iloc[]：与df.iloc[]用法相似

import numpy as np
import pandas as pd
s = pd.Series(range(10,15),index=['a','b','c','d','e'])
print(s); print('===========')
print(s[2]); print('===========')
print(s['c']); print('===========')
print(s[1:3]); print('===========')			# 顾前不顾后
print(s['b':'d']); print('===========')		# 前后都包含
print(s[s>12]); print('===========')
print(s[[3,1,2]]); print('===========')
print(s[['e','b','d']])
执行结果：
a    10
b    11
c    12
d    13
e    14
dtype: int64
===========
12
===========
12
===========
b    11
c    12
dtype: int64
===========
b    11
c    12
d    13
dtype: int64
===========
d    13
e    14
dtype: int64
===========
d    13
b    11
c    12
dtype: int64
===========
e    14
b    11
d    13
dtype: int64

（4）Series的常用属性

注意：Series没有columns属性！

s.values：返回Series的所有值，数据类型为numpy.ndarray

s.index：返回Series的所有索引，数据类型为pandas.core.indexes.base.Index

s.name：Series的名字（可以赋值修改）

s.index.name：索引的名字（可以赋值修改）

（5）Series的常用方法

s.__len__()和len(s)：返回s的长度（int类型）

s.apply(func)和s.map(func)：将s中的每个元素分别传递给func作为其参数并执行func()，并将每次func()的返回值组成一个结构相同的新的Series，作为s.apply()或s.map()整体的返回值。代码示例见下面例2。所有apply()、applymap()、map()的对比见本章“二、Pandas模块 - 10. DataFrame对象的方法和Pandas模块的方法 - （5）其他重要方法 - ② df.applymap()”

s1.corr(s2)：计算两个Series的Pearson相关系数，返回一个float

s1.cov(s2)：计算两个Series的协方差，返回一个float

s.dropna()：删除s中的NaN

s.head(n)：返回s的至多前n项索引与值组成的Series，n默认为5。此方法用于快速预览，不会修改s本身

s.idxmin()和s.idxmax()：反查s中最小值（最大值）所对应的索引。注意：s.argmin()和s.argmax()两个方法已弃用

s.isin(list)：判断s中的每个元素是否在list中，返回一个与s结构相同但是由布尔值组成的Series

s.isna()和s.isnull()：与df.isna()和df.isnull()类似

s.notna()：与df.notna()类似

s.ptp()：计算s的极差（最大值减最小值），返回float（注意：DataFrame无此方法！）

s.replace('替换前的值','替换后的值',inplace=False)：将s中的值进行替换，当同时进行多个替换时，可以使用字典将替换前的值、替换后的值组成键值对，即s.replace({'旧1':'新1','旧2':'新2'...},inplace=False)

s.sort_values()：按值进行排序，类比df.sort_values()，由于Series只有一列，所以不用输入by=

s.str.字符串方法()：将s中的每个字符串按照指定的方法进行处理并组成一个新的Series，代码示例见下面例3

s.tail(n)：返回s的至多后n项索引与值组成的Series，n默认为5。此方法用于快速预览，不会修改s本身

s.tolist()：将s转换为list格式（不直接修改s，须定义一个变量来接收）（注意：DataFrame无此方法！）

s.unstack()：对层次化索引的Series进行变形（行标签与列标签的转换），详见本章“13.层次化索引 - （3）使用unstack()和stack()和DataFrame对层次化索引的Series进行变形（行标签与列标签的转换）”

s.value_counts()：统计s中的每个值出现的次数，返回一个Series（注意：numpy.ndarray和DataFrame都无此方法！）

s.var()：计算Series的方差，返回一个float

# 例1
import numpy as np
import pandas as pd
s = pd.Series([10,12,11,11,12],index=['a','b','c','d','e'],name='旧名字')
s.name='新名字'
s.index.name = '索引'
print(s); print('===========')
print(s.values,type(s.values)); print('===========')
print(s.index,type(s.index)); print('===========')
print(s.name); print('===========')
print(s.index.name); print('===========')
print(s.head(2)); print('===========')
print(s.tail(2)); print('===========')
print(s.__len__(),len(s),type(len(s))); print('===========')
print(s.tolist(),type(s.tolist())); print('===========')
print(s.value_counts(),type(s.value_counts())); print('===========')
print(s.isin([5,6,7,11,15,16,17]),type(s.isin([5,6,7,11,15,16,17]))); print('===========')
print(s.ptp())
执行结果：
索引
a    10
b    12
c    11
d    11
e    12
Name: 新名字, dtype: int64
===========
[10 12 11 11 12] <class 'numpy.ndarray'>
===========
Index(['a', 'b', 'c', 'd', 'e'], dtype='object', name='索引') <class 'pandas.core.indexes.base.Index'>
===========
新名字
===========
索引
===========
索引
a    10
b    12
Name: 新名字, dtype: int64
===========
索引
d    11
e    12
Name: 新名字, dtype: int64
===========
5 5 <class 'int'>
===========
[10, 12, 11, 11, 12] <class 'list'>
===========
12    2
11    2
10    1
Name: 新名字, dtype: int64 <class 'pandas.core.series.Series'>
===========
索引
a    False
b    False
c     True
d     True
e    False
Name: 新名字, dtype: bool <class 'pandas.core.series.Series'>
===========
2

# 例2：Series.apply(func)和Series.map(func)
import numpy as np
import pandas as pd
s1 = pd.Series([10,20,30], index=['t1','t2','t3'])
s2 = s1.apply(lambda x:x+1)
s3 = s1.map(lambda x:x+2)
print(s1); print('===========')
print(s2); print('===========')
print(s3)
执行结果：
t1    10
t2    20
t3    30
dtype: int64
===========
t1    11
t2    21
t3    31
dtype: int64
===========
t1    12
t2    22
t3    32
dtype: int64

# 例3：s.str.字符串方法()
import numpy as np
import pandas as pd
s1 = pd.Series(['a_b','c_d'],index=['t1','t2'])
s2 = s1.str.replace('_','')
s3 = s1.str.startswith('a')
print(s1); print('===========')
print(s2); print('===========')
print(s3)
执行结果：
t1    a_b
t2    c_d
dtype: object
===========
t1    ab
t2    cd
dtype: object
===========
t1     True
t2    False
dtype: bool

4. DataFrame的创建

DataFrame具有二维的数据结构，它拥有一列index和若干列values，每个DataFrame都是pandas.core.frame.DataFrame的一个实例化对象。

本节仅讨论具有一维index结构的DataFrame，关于具有多维index结构的DataFrame，见本章“13.层次化索引”

语法：pd.DataFrame(data=None, index=None, columns=None, dtype=None)

参数：

data：数据，它可以是dict、一维或二维list、一维或二维numpy.ndarray。当data是一维list或一维numpy.ndarray时，pd.DataFrame()会将其变为一个2行1列的列向量（可参见本章“DataFrame对象的常用属性和方法部分的（4）df.shape中的例子”）。data的默认值为None，此时会创建一个空的DataFrame
index：行索引，默认值为None（默认行索引为0、1、2……）

可以使用list自定义行索引，注意：

① index长度必须与data的行数相等，否则报错；

② 当data为dict类型时，不可以在pd.DataFrame()括号里自定义index，否则会让所有数据类型为Series的列的值都变为NaN。正确的方式有两种：一是在df定义完成后另写一行df.index=[...]；二是在定义字典时值都用Series类型，并给每个Series都单独定义index=[...]

③ 默认的index（0、1、2……）称为position（位置），自定义的index称为label（标签）。未定义index时，只能通过position取值；定义index后，既可以通过position取值，也可以通过label取值。

④ 本节仅讨论具有一维index结构的DataFrame，关于具有多维index结构的DataFrame，见本章“13.层次化索引”
columns：列索引，默认值为None（当data为非dict类型时，默认列索引为0、1、2……；当data为dict类型时，默认列索引为dict的键）

可以使用list自定义列索引，注意：

① columns长度必须与data的列数相等，否则报错；

② 当data为dict类型时，不可以在pd.DataFrame()括号里自定义columns，否则会清空整个DataFrame（见下面的错误演示）。由于dict的键已经作为columns了，因此不能再自定义columns。

③ 默认的columns（0、1、2……）称为position（位置），自定义的columns称为label（标签）。未定义columns时，只能通过position取值；定义columns后，既可以通过position取值，也可以通过label取值。
dtype：数据类型，值为'f'时表示float，值为'i'时表示int

注意：和Series不同，DataFrame没有name参数！

# 通过list和numpy.ndarray创建DataFrame
import numpy as np
import pandas as pd
li = [[44, 55, 66],[77, 88, 99]]
df1 = pd.DataFrame(li,columns=['c1','c2','c3'],index=['t1','t2'])
arr = np.array([[44, 55, 66],[77, 88, 99]])
df2 = pd.DataFrame(arr,columns=['c1','c2','c3'],index = ['t1','t2'])
print(df1); print('===========')
print(df2); print('==========='); print(type(df2))
执行结果：
    c1  c2  c3
t1  44  55  66
t2  77  88  99
===========
    c1  c2  c3
t1  44  55  66
t2  77  88  99
===========
<class 'pandas.core.frame.DataFrame'>

# 通过dict创建DataFrame
import numpy as np
import pandas as pd
dic1 = {
    'A': [30,32],
    'B': np.array([42,38]),
    'C': pd.Series([55,56]),
}
dic2 = {
    'A': pd.Series([30,32], index=['t1','t2']),
    'B': pd.Series([42,38], index=['t1','t2']),
    'C': pd.Series([55,56], index=['t1','t2']),
}
# 正确的创建方式一
df1 = pd.DataFrame(dic1)
df1.index=['t1','t2']
# 正确的创建方式二
df2 = pd.DataFrame(dic2)
# 错误的创建方式三
df3 = pd.DataFrame(dic1,index=['t1','t2'])
# 错误的创建方式四
df4 = pd.DataFrame(dic1,columns=['c1','c2'])
print('正确的创建方式一\n',df1); print('===========')
print('正确的创建方式二\n',df2); print('===========')
print('错误的创建方式三\n',df3); print('===========')
print('错误的创建方式四\n',df4)
执行结果：
正确的创建方式一
      A   B   C
t1  30  42  55
t2  32  38  56
===========
正确的创建方式二
      A   B   C
t1  30  42  55
t2  32  38  56
===========
错误的创建方式三
      A   B   C
t1  30  42 NaN
t2  32  38 NaN
===========
错误的创建方式四
 Empty DataFrame
Columns: [c1, c2]
Index: []

5. DataFrame对象的属性

（1）df.columns

返回df的列索引：

如果未自定义过df.columns，则返回的数据类型为pandas.core.indexes.range.RangeIndex
如果已自定义过df.columns，则返回的数据类型为pandas.core.indexes.base.Index

df.columns支持索引和切片：

当对df.columns里面的单一元素进行索引时（即df.columns[]的中括号里无冒号）：
- 若此时未自定义过df.columns，则返回int类型的位置索引
- 若此时已自定义过df.columns，则返回str类型（或其他类型）的标签索引
当对df.columns切片时（即df.columns[]的中括号里有冒号，不论切片的长度是多少，哪怕切片里只有一项）：
- 若此时未自定义过df.columns，则返回pandas.core.indexes.range.RangeIndex类型的位置索引
- 若此时已自定义过df.columns，则返回pandas.core.indexes.base.Index类型的标签索引

df.columns支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错）

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,index=['t1','t2','t3','t4'])		# 定义df时未定义columns
print('df\n',df); print('===========')
print('df.columns\n',df.columns,'\n',type(df.columns)); print('===========')
print('df.columns[2]\n',df.columns[2],'\n',type(df.columns[2])); print('===========')
print('df.columns[2:3]\n',df.columns[2:3],'\n',type(df.columns[2:3])); print('===========')
# 可以对df.columns整体进行重新赋值
df.columns=['c1','c2','c3','c4']
print('df\n',df,'\n',type(df)); print('===========')
print('df.columns\n',df.columns,'\n',type(df.columns)); print('===========')
print('df.columns[2]\n',df.columns[2],'\n',type(df.columns[2])); print('===========')
print('df.columns[2:3]\n',df.columns[2:3],'\n',type(df.columns[2:3])); print('===========')
# 不可以对df.columns中的元素进行修改（会报错）
df.columns[1]='New'
print(df)
执行结果：
df
      0   1   2   3
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
df.columns
 RangeIndex(start=0, stop=4, step=1)
 <class 'pandas.core.indexes.range.RangeIndex'>
===========
df.columns[2]
 2
 <class 'int'>
===========
df.columns[2:3]
 RangeIndex(start=2, stop=3, step=1)
 <class 'pandas.core.indexes.range.RangeIndex'>
===========
df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
 <class 'pandas.core.frame.DataFrame'>
===========
df.columns
 Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
 <class 'pandas.core.indexes.base.Index'>
===========
df.columns[2]
 c3
 <class 'str'>
===========
df.columns[2:3]
 Index(['c3'], dtype='object')
 <class 'pandas.core.indexes.base.Index'>
===========
报错（TypeError: Index does not support mutable operations）

（2）df.index

返回df的行索引：

如果未自定义过df.index，则返回的数据类型为pandas.core.indexes.range.RangeIndex
如果已自定义过df.index，则返回的数据类型为pandas.core.indexes.base.Index

df.index支持索引和切片：

当对df.index里面的单一元素进行索引时（即df.index[]的中括号里无冒号）：
- 若此时未自定义过df.index，则返回int类型的位置索引
- 若此时已自定义过df.index，则返回str类型（或其他类型）的标签索引
当对df.index切片时（即df.index[]的中括号里有冒号，不论切片的长度是多少，哪怕切片里只有一项）：
- 若此时未自定义过df.index，则返回pandas.core.indexes.range.RangeIndex类型的位置索引
- 若此时已自定义过df.index，则返回pandas.core.indexes.base.Index类型的标签索引

df.index支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错）

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])		# 定义df时未定义index
print('df\n',df); print('===========')
print('df.index\n',df.index,'\n',type(df.index)); print('===========')
print('df.index[2]\n',df.index[2],'\n',type(df.index[2])); print('===========')
print('df.index[2:3]\n',df.index[2:3],'\n',type(df.index[2:3])); print('===========')
# 可以对df.index整体进行重新赋值
df.index=['t1','t2','t3','t4']
print('df\n',df,'\n',type(df)); print('===========')
print('df.index\n',df.index,'\n',type(df.index)); print('===========')
print('df.index[2]\n',df.index[2],'\n',type(df.index[2])); print('===========')
print('df.index[2:3]\n',df.index[2:3],'\n',type(df.index[2:3])); print('===========')
# 不可以对df.index中的元素进行修改（会报错）
df.index[2]='New'
print(df)
执行结果：
df
    c1  c2  c3  c4
0   1   2   3   4
1   5   6   7   8
2   9  10  11  12
3  13  14  15  16
===========
df.index
 RangeIndex(start=0, stop=4, step=1)
 <class 'pandas.core.indexes.range.RangeIndex'>
===========
df.index[2]
 2
 <class 'int'>
===========
df.index[2:3]
 RangeIndex(start=2, stop=3, step=1)
 <class 'pandas.core.indexes.range.RangeIndex'>
===========
df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
 <class 'pandas.core.frame.DataFrame'>
===========
df.index
 Index(['t1', 't2', 't3', 't4'], dtype='object')
 <class 'pandas.core.indexes.base.Index'>
===========
df.index[2]
 t3
 <class 'str'>
===========
df.index[2:3]
 Index(['t3'], dtype='object')
 <class 'pandas.core.indexes.base.Index'>
===========
报错（TypeError: Index does not support mutable operations）

（3）df.index.name和df.index.names

对于单一索引的df来说，df.index.name和df.index.names是同一个东西，对其中一个赋值会覆盖另一个的值，示例代码见例1、例2

对于层次化索引的df来说，df.index.name是一个标量（整个层次化索引的名字），而df.index.names是一个矢量（每一列层次化索引的单独的列名），示例代码见例3

# 例1：单一索引的df.index.names覆盖df.index.name
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3])
df.columns = ['c1']
df.index = ['t1','t2','t3']
df.index.name = 'my_index'
df.index.names = ['t']      # 将前面的df.index.name覆盖掉了
print(df); print('===========')
print(df.index); print('===========')
print(df.index.name); print('===========')
print(df.index.names)
执行结果：
    c1
t
t1   1
t2   2
t3   3
===========
Index(['t1', 't2', 't3'], dtype='object', name='t')
===========
t
===========
['t']

# 例2：单一索引的df.index.name覆盖df.index.names
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3])
df.columns = ['c1']
df.index = ['t1','t2','t3']
df.index.names = ['t']
df.index.name = 'my_index'      # 将前面的df.index.names覆盖掉了
print(df); print('===========')
print(df.index); print('===========')
print(df.index.name); print('===========')
print(df.index.names)
执行结果：
          c1
my_index
t1         1
t2         2
t3         3
===========
Index(['t1', 't2', 't3'], dtype='object', name='my_index')
===========
my_index
===========
['my_index']

# 例3：层次化索引的df.index.name和df.index.names是互相互独立的
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8])
df.columns = ['c1']
df.index = [['A','A','B','B','C','C','D','D'],
            ['e','f','e','g','f','h','g','h']]
df.index.name = 'my_multi_index'
df.index.names = ['i1','i2']
print(df); print('===========')
print(df.index); print('===========')
print(df.index.name); print('===========')
print(df.index.names)
执行结果：
       c1
i1 i2
A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
===========
MultiIndex([('A', 'e'),
            ('A', 'f'),
            ('B', 'e'),
            ('B', 'g'),
            ('C', 'f'),
            ('C', 'h'),
            ('D', 'g'),
            ('D', 'h')],
           name='my_multi_index')
===========
my_multi_index
===========
['i1', 'i2']

（4）df.index.levels

以list形式返回df的各级层次化索引

注意：未设定行索引时df.index是pandas.core.indexes.range.RangeIndex的实例化对象，设定单一层次化索引时df.index是pandas.core.indexes.base.Index的实例化对象，这两种情况下执行df.index.levels都会报错，因为它们没有这个属性。仅两层或更多层的索引（pandas.core.indexes.multi.MultiIndex的实例化对象）才有levels属性。

# 一层index
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3])
df.columns = ['c1']
df.index = ['t1','t2','t3']
print(type(df.index))
print(df.index.levels)
执行结果：
<class 'pandas.core.indexes.base.Index'>
报错

# 两层index
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3])
df.columns = ['c1']
df.index = [['A','B','C'],['t1','t2','t3']]
print(type(df.index))
print(df.index.levels)
执行结果：
<class 'pandas.core.indexes.multi.MultiIndex'>
[['A', 'B', 'C'], ['t1', 't2', 't3']]

# 三层index
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3])
df.columns = ['c1']
df.index = [[10,20,30],['A','B','C'],['t1','t2','t3']]
print(type(df.index))
print(df.index.levels)
执行结果：
<class 'pandas.core.indexes.multi.MultiIndex'>
[[10, 20, 30], ['A', 'B', 'C'], ['t1', 't2', 't3']]

（5）df.dtypes

返回每列的数据类型，即返回df中每列Series的dtype

返回值整体是一个Series

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,'a',3+4j],[2,'b',5+6j]],columns=['c1','c2','c3'],index=['t1','t2'])
print(df.dtypes)
print('==============')
print(type(df.dtypes))
执行结果：
c1         int64
c2        object
c3    complex128
dtype: object
==============
<class 'pandas.core.series.Series'>

（6）df.shape

返回一个元组，元组中的两项分别是df的行数和列数

import numpy as np
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame([10,11])				# pd.DataFrame()将一维列表变为了列向量
df3 = pd.DataFrame(np.array([10,11]))	# pd.DataFrame()将一维numpy.ndarray变为了列向量
df4 = pd.DataFrame([[10,11]])
df5 = pd.DataFrame([[10],[11]])
df6 = pd.DataFrame([[10,11],[12,13]])
print(df1.shape,type(df1.shape))
print(df2.shape,type(df2.shape))
print(df3.shape,type(df3.shape))
print(df4.shape,type(df4.shape))
print(df5.shape,type(df5.shape))
print(df6.shape,type(df6.shape))
执行结果：
(0, 0) <class 'tuple'>
(2, 1) <class 'tuple'>
(2, 1) <class 'tuple'>
(1, 2) <class 'tuple'>
(2, 1) <class 'tuple'>
(2, 2) <class 'tuple'>

（7）df.values

获取df的所有值，不含行索引、列索引，返回numpy.ndarray类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

（8）df.列标签

获取df的某一列，返回Series类型，注意df.列标签是不加引号的。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

（9）df.empty

判断df是否为空，返回bool。当df=pd.DataFrame()时，df为空，返回True；只要df中有数据，哪怕这些数据本身都为空，df也不为空（返回False）。

6. DataFrame的数据选择

（1）使用df[]进行选择

df[]支持的操作有：

使用 df['列标签'] 获取某一列（Series类型）
使用 df[['列标签','列标签']] 获取不连续的一列或多列（花式索引）（DataFrame类型）
使用 df['行标签':'行标签']、df[:'行标签']、df['行标签':] 获取连续的一行或多行（DataFrame类型）
使用 df[行索引号:行索引号]、df[:行索引号]、df[行索引号:] 获取连续的一行或多行（DataFrame类型）
当df行索引为pandas.core.indexes.datetimes.DatetimeIndex或pandas.core.indexes.period.PeriodIndex类型时，还可以使用df['年-月']、df['年.月']、df['年']等模糊索引方式获取满足条件的若干行（DataFrame类型），详见本章“14. Pandas中的时间相关格式及方法-（1）Pandas中的时间格式及特殊索引、切片方法”

df[]不支持的操作包括但不限于：

对连续的列进行切片
对不连续的行进行切片

# 例1：前四种索引方式
import numpy as np
import pandas as pd
arr=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
print("查看df\n",df); print('===========')
# 使用 df['列标签'] 获取某一列（Series类型）
print("获取'c3'列\n",df['c3'],'\n',type(df['c3'])); print('===========')
# 使用 df[['列标签','列标签']] 获取不连续的若干列（花式索引）（DataFrame类型）
print("获取'c3','c1'列\n",df[['c3','c1']],'\n',type(df[['c3','c1']])); print('===========')
# 使用 df['行标签':'行标签']、df[:'行标签']、df['行标签':] 获取连续的一行或多行（DataFrame类型）
print("获取't3'行\n",df['t3':'t3'],'\n',type(df['t3':'t3'])); print('===========')
print("获取第一行到't3'行\n",df[:'t3'],'\n',type(df['t3':'t3'])); print('===========')
# 使用 df[行索引号:行索引号]、df[:行索引号]、df[行索引号:] 获取连续的一行或多行（DataFrame类型）
print("获取第三行\n",df[2:3],'\n',type(df[2:3])); print('===========')
print("获取第一行到第三行\n",df[:3],'\n',type(df[:3]))
执行结果：
查看df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
获取'c3'列
 t1     3
t2     7
t3    11
t4    15
Name: c3, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取'c3','c1'列
     c3  c1
t1   3   1
t2   7   5
t3  11   9
t4  15  13
 <class 'pandas.core.frame.DataFrame'>
===========
获取't3'行
     c1  c2  c3  c4
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取第一行到't3'行
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取第三行
     c1  c2  c3  c4
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取第一行到第三行
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>

（2）使用基于标签索引的df.loc[]

loc是location的简写

df.loc[]支持的索引类型如下：

	df有列标签索引	df无列标签索引
df有行标签索引	支持行标签索引、列标签索引不支持行位置索引、列位置索引	支持行标签索引、列位置索引不支持行位置索引、列标签索引
df无行标签索引	支持行位置索引、列标签索引不支持行标签索引、列位置索引	支持行位置索引、列位置索引不支持行标签索引、列标签索引

总结一句话：有标签索引时只能用标签索引，没有标签索引时才能用位置索引

注意标签索引使用冒号:切片时是前后都包含的，而位置索引使用冒号:切片时是顾前不顾后的，此外还要注意位置索引是从0开始的

可以通过df.loc[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

关于pd.date_range()的标签索引方式，详见本章“10.DataFrame对象的方法和Pandas模块的方法 - （4）时间相关方法 - ①pd.date_range()”

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
print("查看df\n",df); print('===========')
# 使用 df.loc['行标签'] 获取某一行（Series类型）
print("获取't1'行\n",df.loc['t2'],'\n',type(df.loc['t2'])); print('===========')
# 使用 df.loc['行标签':'行标签'] 获取连续的若干行（前后都包含）（DataFrame类型）
print("获取't2'至't3'行（前后都包含）\n",df.loc['t2':'t3'],'\n',type(df.loc['t2':'t3'])); print('===========')
# 使用 df.loc[['行标签','行标签']] 获取不连续的若干行（花式索引）（DataFrame类型）
print("获取't3'和't1'行\n",df.loc[['t3','t1']],'\n',type(df.loc[['t3','t1']])); print('===========')
# 使用 df.loc[:,'列标签'] 获取某一列（Series类型）
print("获取'c2'列\n",df.loc[:,'c2'],'\n',type(df.loc[:,'c2'])); print('===========')
# 使用 df.loc[:,'列标签':'列标签'] 获取连续的若干列（前后都包含）（DataFrame类型）
print("获取'c2'至'c3'列（前后都包含）\n",df.loc[:,'c2':'c3'],'\n',type(df.loc[:,'c2':'c3'])); print('===========')
# 使用 df.loc[:,['行标签','行标签']] 获取不连续的若干列（花式索引）（DataFrame类型）
print("获取'c3'和'c1'列\n",df.loc[:,['c3','c1']],'\n',type(df.loc[:,['c3','c1']])); print('===========')
# 使用 df.loc['行标签','列标签'] 获取某一个元素（该位置元素本身的数据类型）
print("获取't4'行,'c4'列位置的元素\n",df.loc['t4','c4'],'\n',type(df.loc['t4','c4'])); print('===========')
# 使用上述方法的各种组合获取某几行、某几列（若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型）
print("获取't2'至't3'行,'c2'至'c3'列\n",df.loc['t2':'t3','c2':'c3'],'\n',type(df.loc['t2':'t3','c2':'c3'])); print('===========')
print("获取't4'行,'c4'和'c1'列\n",df.loc['t4',['c4','c1']],'\n',type(df.loc['t4',['c4','c1']]))
执行结果：
查看df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
获取't1'行
 c1    5
c2    6
c3    7
c4    8
Name: t2, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取't2'至't3'行（前后都包含）
     c1  c2  c3  c4
t2   5   6   7   8
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取't3'和't1'行
     c1  c2  c3  c4
t3   9  10  11  12
t1   1   2   3   4
 <class 'pandas.core.frame.DataFrame'>
===========
获取'c2'列
 t1     2
t2     6
t3    10
t4    14
Name: c2, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取'c2'至'c3'列（前后都包含）
     c2  c3
t1   2   3
t2   6   7
t3  10  11
t4  14  15
 <class 'pandas.core.frame.DataFrame'>
===========
获取'c3'和'c1'列
     c3  c1
t1   3   1
t2   7   5
t3  11   9
t4  15  13
 <class 'pandas.core.frame.DataFrame'>
===========
获取't4'行,'c4'列位置的元素
 16
 <class 'numpy.int32'>
===========
获取't2'至't3'行,'c2'至'c3'列
     c2  c3
t2   6   7
t3  10  11
 <class 'pandas.core.frame.DataFrame'>
===========
获取't4'行,'c4'和'c1'列
 c4    16
c1    13
Name: t4, dtype: int32
 <class 'pandas.core.series.Series'>

（3）使用基于位置索引的df.iloc[]

iloc是index location的简写

即使DataFrame已自定义了columns和index，仍然可以使用位置索引通过df.iloc[]进行选择

和df.loc[]不同，不论df有没有行标签、列标签，df.iloc[]都只支持位置索引（position index），位置索引使用冒号:切片时是顾前不顾后的，此外还要注意位置索引是从0开始的

可以通过df.iloc[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

特别地，使用df.iloc[]可以实现倒序排列：

df.iloc[::-1,:]：倒序排所有行
df.iloc[:,::-1]：倒序排所有列
df.iloc[::-1,::-1]：倒序排所有行、所有列

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
print("查看df\n",df); print('===========')
# 使用 df.iloc[行索引] 获取某一行（Series类型）
print("获取最后一行\n",df.iloc[-1],'\n',type(df.iloc[-1])); print('===========')
# 使用 df.iloc[行索引:行索引] 获取连续的若干行（顾前不顾后）（DataFrame类型）
print("获取第二行至倒数第二行\n",df.iloc[1:-1],'\n',type(df.iloc[1:-1])); print('===========')
# 使用 df.iloc[[行索引,行索引]] 获取不连续的若干行（花式索引）（DataFrame类型）
print("获取倒数第二行和第一行\n",df.iloc[[-2,0]],'\n',type(df.iloc[[-2,0]])); print('===========')
# 使用 df.iloc[:,列索引] 获取某一列（Series类型）
print("获取最后一列\n",df.iloc[:,-1],'\n',type(df.iloc[:,-1])); print('===========')
# 使用 df.iloc[:,列索引:列索引] 获取连续的若干列（顾前不顾后）（DataFrame类型）
print("获取第二列至倒数第二列\n",df.iloc[:,1:-1],'\n',type(df.iloc[:,1:-1])); print('===========')
# 使用 df.iloc[:,[列索引,列索引]] 获取不连续的若干列（花式索引）（DataFrame类型）
print("获取倒数第二列和第一列\n",df.iloc[:,[-2,0]],'\n',type(df.iloc[:,[-2,0]])); print('===========')
# 使用 df.iloc[行索引,列索引] 获取某一个元素（该位置元素本身的数据类型）
print("获取最后一行,最后一列的元素\n",df.iloc[-1,-1],'\n',type(df.iloc[-1,-1])); print('===========')
# 使用上述方法的各种组合获取某几行、某几列（若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型）
print("获取第一行至倒数第二行,第三列至最后一列\n",df.iloc[:-1,2:],'\n',type(df.iloc[:-1,2:])); print('===========')
print("获取第四行,最后一列和倒数第三列\n",df.iloc[3,[-1,-3]],'\n',type(df.iloc[3,[-1,-3]])); print('===========')
# 使用df.iloc[]实现倒序排
print("倒序排所有行\n",df.iloc[::-1,:]); print('===========')
print("倒序排所有列\n",df.iloc[:,::-1]); print('===========')
print("倒序排所有行和所有列\n",df.iloc[::-1,::-1])
执行结果：
查看df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
获取最后一行
 c1    13
c2    14
c3    15
c4    16
Name: t4, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取第二行至倒数第二行
     c1  c2  c3  c4
t2   5   6   7   8
t3   9  10  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取倒数第二行和第一行
     c1  c2  c3  c4
t3   9  10  11  12
t1   1   2   3   4
 <class 'pandas.core.frame.DataFrame'>
===========
获取最后一列
 t1     4
t2     8
t3    12
t4    16
Name: c4, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取第二列至倒数第二列
     c2  c3
t1   2   3
t2   6   7
t3  10  11
t4  14  15
 <class 'pandas.core.frame.DataFrame'>
===========
获取倒数第二列和第一列
     c3  c1
t1   3   1
t2   7   5
t3  11   9
t4  15  13
 <class 'pandas.core.frame.DataFrame'>
===========
获取最后一行,最后一列的元素
 16
 <class 'numpy.int32'>
===========
获取第一行至倒数第二行,第三列至最后一列
     c3  c4
t1   3   4
t2   7   8
t3  11  12
 <class 'pandas.core.frame.DataFrame'>
===========
获取第四行,最后一列和倒数第三列
 c4    16
c2    14
Name: t4, dtype: int32
 <class 'pandas.core.series.Series'>
===========
倒序排所有行
     c1  c2  c3  c4
t4  13  14  15  16
t3   9  10  11  12
t2   5   6   7   8
t1   1   2   3   4
===========
倒序排所有列
     c4  c3  c2  c1
t1   4   3   2   1
t2   8   7   6   5
t3  12  11  10   9
t4  16  15  14  13
===========
倒序排所有行和所有列
     c4  c3  c2  c1
t4  16  15  14  13
t3  12  11  10   9
t2   8   7   6   5
t1   4   3   2   1

（4）使用基于混合索引的df.ix[]（新版Pandas即将取消该功能）

df.ix[]既支持位置索引（position index），也支持标签索引（label index），位置索引使用冒号:切片时是顾前不顾后的，标签索引使用冒号:切片时是前后都包含的

可以通过df.ix[]选取任意行、任意列。若选取的行数、列数都为1，则返回该位置元素本身的数据类型；若行数、列数只有一个为1，则返回Series类型；若行数、列数都不为1，则返回DataFrame类型

关于pd.date_range()的标签索引方式，详见本章“10.DataFrame对象的方法和Pandas模块的方法 - （4）时间相关方法 - ①pd.date_range()”

注意事项：

当df.ix[]的中括号里没有逗号时，自动视为行索引或行标签
在df.ix[,]中，可以对行使用一种索引方法，对列使用另一种索引方法
不可以在冒号:两边分别使用位置索引和标签索引
不可以在花式索引列表[]中同时出现位置索引和标签索引
使用df.ix[]会有warning警告，因为新版Pandas即将取消该功能

import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
print("查看df\n",df); print('===========')
# 正确的示例
print("获取最后一行\n",df.ix[-1]); print('===========')
print("获取第一行至't3'行\n",df.ix[:'t3']); print('===========')
print("获取第一行至倒数第三行,'c4'列和'c3'列\n",df.ix[:-2, ['c4','c3']])
# 错误的示例
# print("不可以在冒号:两边分别使用位置索引和标签索引，会报错\n",df.ix[1:'t1', 2])
# print("不可以在花式索引列表[]中同时出现位置索引和标签索引，会报错\n",df.ix[1:, [1,'c3']])
执行结果：
查看df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
获取最后一行
 c1    13
c2    14
c3    15
c4    16
Name: t4, dtype: int32
===========
获取第一行至't3'行
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
===========
获取第一行至倒数第三行,'c4'列和'c3'列
     c4  c3
t1   4   3
t2   8   7

（5）使用df的属性和方法进行选择

具体包括下面几种属性和方法：

df.列标签：获取某一列，返回Series类型，注意df.列标签是不加引号的
df.values：获取所有值，不含行索引、列索引，返回numpy.ndarray类型
df.head(n=5)：获取前n行，n的默认值为5，返回DataFrame类型
df.tail(n=5)：获取后n行，n的默认值为5，返回DataFrame类型

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
print("查看df\n",df); print('===========')
# 使用 df.列标签 获取某一列（Series类型）
print("获取'c3'列\n",df.c3,'\n',type(df.c3)); print('===========')
# 无法使用 df.列索引号 获取某一列
# 无法使用 df.行标签 获取某一行
# 无法使用 df.行索引号 获取某一行
# 使用 df.values 获取所有值，不含行索引、列索引（numpy.ndarray类型）
print("获取df所有值\n",df.values,'\n',type(df.values))
# 使用 df.head(n) 获取前n行（DataFrame类型）
print("获取前2行\n",df.head(2),'\n',type(df.head(2))); print('===========')
# 使用 df.tail(n) 获取后n行（DataFrame类型）
print("获取后2行\n",df.tail(2),'\n',type(df.tail(2)))
执行结果：
查看df
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
t3   9  10  11  12
t4  13  14  15  16
===========
获取'c3'列
 t1     3
t2     7
t3    11
t4    15
Name: c3, dtype: int32
 <class 'pandas.core.series.Series'>
===========
获取df所有值
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]
 <class 'numpy.ndarray'>
获取前2行
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
 <class 'pandas.core.frame.DataFrame'>
===========
获取后2行
     c1  c2  c3  c4
t3   9  10  11  12
t4  13  14  15  16
 <class 'pandas.core.frame.DataFrame'>

（6）使用布尔值索引筛选满足条件的行

如果需要根据某列的值是否满足给定的条件，筛选出满足条件的整行数据（或这些行指定字段的数据），可以使用下面的方法：

# 用于筛选满足条件的行
df[由布尔值组成的list]
df[由“df.列标签”组成的条件]
df[由“df['列标签']”组成的条件]
df.loc[由“df.列标签”组成的条件, :]
df.loc[由“df['列标签']”组成的条件, :]
# 用于筛选满足条件的行、列
df.loc[由“df.列标签”组成的条件, 由“df.列标签”组成的条件]
df.loc[由“df['列标签']”组成的条件, 由“df['列标签']”组成的条件]

条件之间的逻辑运算符有|、&、~、np.logical_or()、np.logical_and()、np.logical_not()，每个运算符的详细介绍见“第五章 Python编程进阶 - 一、NumPy模块 - 8. ndarray对象的方法和NumPy模块的方法 - （2）二元通用函数 - ③ 基本逻辑运算”

可以根据布尔值的特性（True=1，False=0），把条件*1并用+连接，以便对满足条件的数量进行筛选

import numpy as np
import pandas as pd
df = pd.DataFrame([[10,8,6],[8,15,13],[13,7,14],[9,9,11]],columns=['c1','c2','c3'],index=['t1','t2','t3','t4'])
print(df); print('===========')
print(df.c1>9); print('===========')						# 写成df['c1']>9也行
print(df[df.c1>9]); print('===========')					# 写成df[df['c1']>9]也行
print(df[(df.c1>9) & (df.c3>9)]); print('===========')      # 筛选两个条件都满足的（且）
print(df[(df.c1>9) | (df.c3>9)]); print('===========')      # 筛选满足任意一个条件的（或）
print(df[(df.c1>9)*1 + (df.c2>9)*1 + (df.c3>9)*1 >=2 ])     # 筛选三个条件中至少满足两个的（布尔值特性）
print('===========')
print(df.loc[df.c1.isin([8,9,22,33]),['c2','c3']])          # 筛选'c1'列的值在给定列表里的行的'c2'和'c3'列
执行结果：
    c1  c2  c3
t1  10   8   6
t2   8  15  13
t3  13   7  14
t4   9   9  11
===========
t1     True
t2    False
t3     True
t4    False
Name: c1, dtype: bool
===========
    c1  c2  c3
t1  10   8   6
t3  13   7  14
===========
    c1  c2  c3
t3  13   7  14
===========
    c1  c2  c3
t1  10   8   6
t2   8  15  13
t3  13   7  14
t4   9   9  11
===========
    c1  c2  c3
t2   8  15  13
t3  13   7  14
===========
    c2  c3
t2  15  13
t4   9  11

（7）使用df.query()筛选满足条件的行

语法：df.query('列标签组成的str格式表达式',inplace=False)

通过列标签组成的str格式表达式筛选满足条件的行，返回DataFrame格式的筛选结果，注意：

当表达式中含有变量时，需要在变量名称前加@符号
当表达式中含有带空格的列标签时，需要在此列标签的两侧加`符号

import numpy as np
import pandas as pd
df = pd.DataFrame([[10,8,6],[8,15,13],[13,7,14],[9,9,11]],
                  columns=['c1','c2','c 3'],
                  index=['t1','t2','t3','t4'])
var = 9
print(df); print('===========')
print(df.query('c1 > 9')); print('===========')
print(df.query('c1 > @var')); print('===========')
# print(df.query('c1 > c 3')); print('===========')			# 报错
print(df.query('c1 > `c 3`'))
执行结果：
    c1  c2  c 3
t1  10   8    6
t2   8  15   13
t3  13   7   14
t4   9   9   11
===========
    c1  c2  c 3
t1  10   8    6
t3  13   7   14
===========
    c1  c2  c 3
t1  10   8    6
t3  13   7   14
===========
    c1  c2  c 3
t1  10   8    6

（8）循环遍历df每一行数据

可以使用df.iterrows()返回的生成器实现，见本章“10. DataFrame对象的方法和Pandas模块的方法 - （5）其他重要方法 - ② df.iterrows()”

7. DataFrame的向量化、对齐和广播

（1）向量化和对齐

DataFrame的向量化是一种比numpy.ndarray和Series更为广义的向量化：

算数运算（+、-、*、/、**、//、%）
- 当两个DataFrame的shape、行标签、列标签都完全相同时，它们的算数运算就是对应项的运算，结果也是一个shape相同的DataFrame
- 当两个DataFrame的shape、行标签、列标签不完全相同时，它们之间也可以进行算数运算，此时行标签、列标签都相同的项才会执行元素级别的计算，不同的项则返回NaN。最终得到的DataFrame的shape会大于两个参与运算的DataFrame，因为前者的行标签是后两者行标签的并集，前者的列标签也是后两者列标签的并集。上述规则称为DataFrame的对齐。
- DataFrame可以和一个shape相同的二维numpy.ndarray进行算数运算，返回一个shape、行标签、列标签都相同的DataFrame
- DataFrame不能和一个shape不同的二维numpy.ndarray进行算数运算，也不能和任何二维list进行算数运算（哪怕二者shape相同）（报错）
比较运算（><、>=、<=、==、!=）
- 当两个DataFrame的shape、行标签、列标签都完全相同时，它们之间可以执行比较运算，返回一个shape相同的DataFrame，值为True或False
- 当两个DataFrame的shape、行标签、列标签不完全相同时，它们之间不能执行比较运算（报错），即此时DataFrame无法对齐
- DataFrame可以和一个shape相同的二维numpy.ndarray进行比较运算，返回一个shape、行标签、列标签都相同的DataFrame，值为True或False
- DataFrame不能和一个shape不同的二维numpy.ndarray进行比较运算，也不能和任何二维list进行比较运算（哪怕二者shape相同）（报错）

关于对齐的总结：如果两个DataFrame的shape、行标签、列标签不完全相同，进行算数运算时可以实现对齐，进行比较运算时无法实现对齐（只能报错）

关于对齐产生的NaN的后续处理：详见本章 “ 7.DataFrame的修改 - 对NaN进行替换 “

上面两类运算均未提到DataFrame和Series之间的计算，因为DataFrame必然是二维的，Series必然是一维的，因此它们二者之间只能是广播的关系，不属于向量化的范畴，其规则详见本章“广播”部分

# DataFrame的对齐
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])
df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns=['c0','c3','c1'],index=['t0','t2','t1'])
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df1 + df2\n',df1 + df2); print('===========')
print('df1 // df2\n',df1 // df2); print('===========')
print('df1 > df2\n',df1 > df2)
执行结果：
df1
     c1  c2  c3  c4
t1  10  20  30  40
t2  50  60  70  80
===========
df2
     c0  c3  c1
t0   1   2   3
t2   4   5   6
t1   7   8   9
===========
df1 + df2
     c0    c1  c2    c3  c4
t0 NaN   NaN NaN   NaN NaN
t1 NaN  19.0 NaN  38.0 NaN
t2 NaN  56.0 NaN  75.0 NaN
===========
df1 // df2
     c0   c1  c2    c3  c4
t0 NaN  NaN NaN   NaN NaN
t1 NaN  1.0 NaN   3.0 NaN
t2 NaN  8.0 NaN  14.0 NaN
===========
报错（ValueError: Can only compare identically-labeled DataFrame objects）

# 仅当两个DataFrame的shape、行标签、列标签都完全相同时，它们之间才能执行><、>=、<=、==、!=的比较运算
import numpy as np
import pandas as pd
df1 = pd.DataFrame([[3,4],[5,6]],columns=['c1','c2'],index=['t1','t2'])
df2 = pd.DataFrame([[1,2],[7,8]],columns=['c1','c2'],index=['t1','t2'])
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df1 < df2\n',df1 < df2)
执行结果：
df1
     c1  c2
t1   3   4
t2   5   6
===========
df2
     c1  c2
t1   1   2
t2   7   8
===========
df1 < df2
        c1     c2
t1  False  False
t2   True   True

（2）广播

① DataFrame与数字进行计算

对于DataFrame与数字进行的+、-、*、/、**、//、%、>、<、>=、<=、==、!=等运算，会将这个DataFrame中的每一个元素均与这个数字进行计算，并用这些结果组成一个与原DataFrame结构、行标签、列标签都相同的DataFrame

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df // 2); print('===========')
print(df < 2)
执行结果：
    c1  c2
t1   1   2
t2   3   4
===========
    c1  c2
t1   0   1
t2   1   2
===========
       c1     c2
t1   True  False
t2  False  False

② DataFrame与一维对象进行计算

DataFrame和DataFrame：

DataFrame和DataFrame之间不存在广播的概念，因为DataFrame本身是二维的，因此无论是几行几列，两个DataFrame之间的计算都属于向量化的范畴，详见本章的“向量化与对齐”的部分

DataFrame和一维list：

DataFrame和一维list之间可以广播的前提：DataFrame的列数等于一维list的长度。广播时，按行广播。

DataFrame和一维numpy.ndarray：

DataFrame和一维numpy.ndarray之间可以广播的前提：DataFrame的列数等于一维numpy.ndarray的长度。广播时，按行广播。

DataFrame和Series：

DataFrame和Series之间可以广播的前提：DataFrame的列数等于Series的长度，且Series的index与DataFrame的columns一一对应（顺序可以不同）。广播时，按行广播，标签相同的项执行对应的计算。

# DataFrame 与 一维list 的广播示例
import numpy as np
import pandas as pd
df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])
li= [1,2,100,100] #可以结果正确
print('查看df\n',df); print('===========')
print('查看li\n',li); print('===========')
print('df + li\n',df + li); print('===========')
print('df > li\n',df > li)
执行结果：
查看df
     c1  c2  c3  c4
t1  10  20  30  40
t2  50  60  70  80
===========
查看li
 [1, 2, 100, 100]
===========
df + li
     c1  c2   c3   c4
t1  11  22  130  140
t2  51  62  170  180
===========
df > li
       c1    c2     c3     c4
t1  True  True  False  False
t2  True  True  False  False

# DataFrame 与 一维numpy.ndarray 的广播示例
import numpy as np
import pandas as pd
df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])
arr = np.array([1,2,100,100]) #可以结果正确
print('查看df\n',df); print('===========')
print('查看arr\n',arr); print('===========')
print('df + arr\n',df + arr); print('===========')
print('df > arr\n',df > arr)
执行结果：
查看df
     c1  c2  c3  c4
t1  10  20  30  40
t2  50  60  70  80
===========
查看arr
 [  1   2 100 100]
===========
df + arr
     c1  c2   c3   c4
t1  11  22  130  140
t2  51  62  170  180
===========
df > arr
       c1    c2     c3     c4
t1  True  True  False  False
t2  True  True  False  False

# DataFrame 与 Series 的广播示例
import numpy as np
import pandas as pd
df = pd.DataFrame([[10,20,30,40],[50,60,70,80]],columns=['c1','c2','c3','c4'],index=['t1','t2'])
s1 = pd.Series([1,2,100,100])                   # 未定义符合条件的index，结果错误
s2 = pd.Series([100, 100, 1, 2], index=['c3','c4','c1','c2'])  # 顺序可以不同，结果依然正确
print('查看df\n',df); print('===========')
print('查看s1\n',s1); print('===========')
print('df + s1（结果错误）\n',df + s1); print('===========')
print('df > s1（结果错误）\n',df > s1); print('===========')
print('查看df\n',df); print('===========')
print('查看s2\n',s2); print('===========')
print('df + s2（结果正确）\n',df + s2); print('===========')
print('df > s2（结果正确）\n',df > s2)
执行结果：
查看df
     c1  c2  c3  c4
t1  10  20  30  40
t2  50  60  70  80
===========
查看s1
 0      1
1      2
2    100
3    100
dtype: int64
===========
df + s1（结果错误）
     c1  c2  c3  c4   0   1   2   3
t1 NaN NaN NaN NaN NaN NaN NaN NaN
t2 NaN NaN NaN NaN NaN NaN NaN NaN
===========
df > s1（结果错误）
        c1     c2     c3     c4      0      1      2      3
t1  False  False  False  False  False  False  False  False
t2  False  False  False  False  False  False  False  False
===========
查看df
     c1  c2  c3  c4
t1  10  20  30  40
t2  50  60  70  80
===========
查看s2
 c3    100
c4    100
c1      1
c2      2
dtype: int64
===========
df + s2（结果正确）
     c1  c2   c3   c4
t1  11  22  130  140
t2  51  62  170  180
===========
df > s2（结果正确）
       c1    c2     c3     c4
t1  True  True  False  False
t2  True  True  False  False

8. DataFrame的修改、变形、转换

（1）增加一行、增加一列

① 基于df.loc[]增加一行、增加一列

语法：

df.loc['新行标签'] = data			  # 增加一行
df.loc[:,'新列标签'] = data			  # 增加一列
df.loc['新行标签','旧列标签'] = data	# 增加一行（仅对部分元素赋值，未赋值元素的是NaN）
df.loc['旧行标签','新列标签'] = data	# 增加一列（仅对部分元素赋值，未赋值元素的是NaN）
df.loc['新行标签','新列标签'] = data	# 增加一行和一列（仅对部分元素赋值，未赋值元素的是NaN）

代码示例：

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('==============')
df.loc['t3'] = [5, 6]
print(df); print('==============')
df.loc[:,'c3'] = [7, 8, 9]
print(df); print('==============')
df.loc['t4','c2'] = 10
print(df); print('==============')
df.loc['t2','c4'] = 11
print(df); print('==============')
df.loc['t5','c5'] = 12
print(df); print('==============')
print(df)
执行结果：
    c1  c2
t1   1   2
t2   3   4
==============
    c1  c2
t1   1   2
t2   3   4
t3   5   6
==============
    c1  c2  c3
t1   1   2   7
t2   3   4   8
t3   5   6   9
==============
     c1    c2   c3
t1  1.0   2.0  7.0
t2  3.0   4.0  8.0
t3  5.0   6.0  9.0
t4  NaN  10.0  NaN
==============
     c1    c2   c3    c4
t1  1.0   2.0  7.0   NaN
t2  3.0   4.0  8.0  11.0
t3  5.0   6.0  9.0   NaN
t4  NaN  10.0  NaN   NaN
==============
     c1    c2   c3    c4    c5
t1  1.0   2.0  7.0   NaN   NaN
t2  3.0   4.0  8.0  11.0   NaN
t3  5.0   6.0  9.0   NaN   NaN
t4  NaN  10.0  NaN   NaN   NaN
t5  NaN   NaN  NaN   NaN  12.0
==============
     c1    c2   c3    c4    c5
t1  1.0   2.0  7.0   NaN   NaN
t2  3.0   4.0  8.0  11.0   NaN
t3  5.0   6.0  9.0   NaN   NaN
t4  NaN  10.0  NaN   NaN   NaN
t5  NaN   NaN  NaN   NaN  12.0

② 基于df[]增加一列

语法：df['列标签'] = data

data的长度必须等于df的行数，data可以是list、tuple、range()、numpy.ndarray、Series、DataFrame

注意：当data为Series或DataFrame时，必须为其定义与df相对应的index（顺序可以不同，Pandas会自动根据标签进行匹配），如果不写index，由于DataFrame的自动对齐，会导致新增的值都是NaN

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,3],[2,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('==============')
print('list、tuple、range()的情况略'); print('==============')
df['c3'] = np.array([5,6])
print(df); print('==============')
df['c4'] = pd.Series([7,8],index=['t2','t1'])	# 给Series定义正确的index，顺序无所谓
print(df); print('==============')
df['c5'] = pd.Series([9,10])					# 未给Series定义index，错误
print(df); print('==============')
df['c6'] = pd.DataFrame([11,12],index=['t2','t1'])	# 给DataFrame定义正确的index，顺序无所谓
print(df); print('==============')
df['c7'] = pd.DataFrame([13,14])					# 未给DataFrame定义index，错误
print(df)
执行结果：
    c1  c2
t1   1   3
t2   2   4
==============
list、tuple、range()的情况略
==============
    c1  c2  c3
t1   1   3   5
t2   2   4   6
==============
    c1  c2  c3  c4
t1   1   3   5   8
t2   2   4   6   7
==============
    c1  c2  c3  c4  c5
t1   1   3   5   8 NaN
t2   2   4   6   7 NaN
==============
    c1  c2  c3  c4  c5  c6
t1   1   3   5   8 NaN  12
t2   2   4   6   7 NaN  11
==============
    c1  c2  c3  c4  c5  c6  c7
t1   1   3   5   8 NaN  12 NaN
t2   2   4   6   7 NaN  11 NaN

③ 基于df.append()增加一行

语法：df.append(series)，详见本章二、Pandas模块 - 12. DataFrame的合并 - （2）df.append()

④ 基于df.assign()增加一列

语法：df = df.assign(新列标签索引=表达式)

常用于根据现有的列进行表达式计算，产生新的列

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('==============')
df = df.assign(c3 = df['c1']/df['c2'])
print(df)
执行结果：
    c1  c2
t1   1   2
t2   3   4
==============
    c1  c2    c3
t1   1   2  0.50
t2   3   4  0.75

（3）删除若干行、删除若干列

① 仅删除一列：del

语法：

当列标签为label index（列标签索引）时，只能使用label index删：del df['列标签索引']
当列标签为position index（列位置索引）时，只能使用position index删：del df[列位置索引]

注意：del方法只能用于删除一列，不能使用索引、切片的方法删除多列

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,3,5,7,9],[2,4,6,8,10]])
print(df); print('==============')
del df[1]
print(df); print('==============')
df.columns=['c1','c2','c3','c4']
df.index=['t1','t2']
print(df); print('==============')
del df['c2']
print(df)
执行结果：
   0  1  2  3   4
0  1  3  5  7   9
1  2  4  6  8  10
==============
   0  2  3   4
0  1  5  7   9
1  2  6  8  10
==============
    c1  c2  c3  c4
t1   1   5   7   9
t2   2   6   8  10
==============
    c1  c3  c4
t1   1   7   9
t2   2   8  10

② 删除若干行、删除若干列：df.drop()

语法：

当标签为label index（标签索引）时，只能使用label index删：

df.drop('标签索引'或['标签索引','标签索引',...], axis=0, inplace=False)

当标签为position index（位置索引）时，只能使用position index删：
```
df.drop(位置索引或[位置索引,位置索引,...], axis=0, inplace=False)
```

参数：

axis：默认为0，按行删；axis=1时，按列删
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(100,149).reshape(7, 7))
print(df); print('==============')
df = df.drop(0)						# 删除第0行
df = df.drop([1,2])					# 删除第1、2行
print(df); print('==============')
df = df.drop(0,axis=1)				# 删除第0列
df = df.drop([1,2],axis=1)			# 删除第1、2列
print(df); print('==============')
df.columns=['c1','c2','c3','c4']
df.index=['t1','t2','t3','t4']
print(df); print('==============')
df = df.drop('t1')					# 删除't1'行
df = df.drop(['t2','t3'])			# 删除't2'、't3'列
print(df); print('==============')
df = df.drop('c1',axis=1)			# 删除'c1'行
df = df.drop(['c2','c3'],axis=1)	# 删除'c2'、'c3'列
print(df)
执行结果：
     0    1    2    3    4    5    6
0  100  101  102  103  104  105  106
1  107  108  109  110  111  112  113
2  114  115  116  117  118  119  120
3  121  122  123  124  125  126  127
4  128  129  130  131  132  133  134
5  135  136  137  138  139  140  141
6  142  143  144  145  146  147  148
==============
     0    1    2    3    4    5    6
3  121  122  123  124  125  126  127
4  128  129  130  131  132  133  134
5  135  136  137  138  139  140  141
6  142  143  144  145  146  147  148
==============
     3    4    5    6
3  124  125  126  127
4  131  132  133  134
5  138  139  140  141
6  145  146  147  148
==============
     c1   c2   c3   c4
t1  124  125  126  127
t2  131  132  133  134
t3  138  139  140  141
t4  145  146  147  148
==============
     c1   c2   c3   c4
t4  145  146  147  148
==============
     c4
t4  148

（3）修改DataFrame数据的值

首先使用df[]、df.loc[]、df.iloc[]、df.ix[]等方法选择相应的数据，然后使用等号赋值即可

特别地，可以对布尔值索引的筛选结果进行赋值修改，如：

# 将所有负数转化为正数
df[df<0] = -df
# 将所有PE小于0的PE数据赋值为1000
df.loc[df['PE'<0],'PE'] = 1000
# 新增一列new_column，并且令PE为负的行的new_column列为0（PE非负的行的new_column列的值将是NaN）
df.loc[df['PE']<0,'new_column'] = 0

（4）对DataFrame数据的值进行替换：df.replace()

语法：

df.replace('替换前的值', '替换后的值', inplace=False)			# 单个值的替换
df.replace({'旧1':'新1','旧2':'新2'...}, inplace=False)		# 多个值的替换

若要同时进行多个值的替换，可以使用字典将替换前的值、替换后的值组成键值对

参数inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

注意：仅对值进行替换，不会对标签索引或位置索引进行替换

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]])
df.columns=[3,4]
print(df); print('===========')
df.replace(3,33,inplace=True)
print(df); print('===========')
df.replace({2:22,4:44},inplace=True)
print(df)
执行结果：
   3  4
0  1  2
1  3  4
===========
    3  4
0   1  2
1  33  4
===========
    3   4
0   1  22
1  33  44

（5）按指定列或行的数据值排序：df.sort_values()

语法：df.sort_values(by, axis=0, ascending=True, inplace=False)

参数：

by：可以是列标签或行标签（单列排序），也可以是列标签组成的list或行标签组成的list（多列排序），注意by与axis的对应关系（axis=0时by='列标签'或['列标签1','列标签2',...]，axis=1时by='行标签'或['行标签1','行标签2',...]）
axis：默认值为0，按指定列的数据值排序；axis=1时，按指定行的数据值排序
ascending：默认值为True，升序排列；ascending=False时，降序排列
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np
import pandas as pd
np.random.seed(1)
arr = np.random.randint(1,100,(4,4))
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])
df.index = pd.date_range('2019-1-31', periods=4, freq='M')
print('df\n',df); print('===========')
# 按c1列的值的升序排列
df = df.sort_values(by='c1')
# df.sort_values(by='c1',inplace=True)   # 这样写也行
print('按c1列的值的升序排列\n',df); print('===========')
# 按2019-04-30行的值的降序排列
df = df.sort_values(by=pd.datetime(2019,4,30),axis=1,ascending=False)
# df.sort_values(by=pd.datetime(2019,4,30),axis=1,ascending=False,inplace=True) # 这样写也行
print('按2019-04-30行的值的降序排列\n',df)
执行结果：
df
             c1  c2  c3  c4
2019-01-31  38  13  73  10
2019-02-28  76   6  80  65
2019-03-31  17   2  77  72
2019-04-30   7  26  51  21
===========
按c1列的值的升序排列
             c1  c2  c3  c4
2019-04-30   7  26  51  21
2019-03-31  17   2  77  72
2019-01-31  38  13  73  10
2019-02-28  76   6  80  65
===========
按2019-04-30行的值的降序排列
             c3  c2  c4  c1
2019-04-30  51  26  21   7
2019-03-31  77   2  72  17
2019-01-31  73  13  10  38
2019-02-28  80   6  65  76

（6）按DataFrame的标签排序：df.sort_index()

语法：df.sort_index(axis=0, ascending=True, inplace=False)

参数：

axis：默认值为0，按行标签排序；axis=1时，按列标签排序
ascending：默认值为True，升序排列；ascending=False时，降序排列
inplace：默认值为False，不对df本身进行修改；inplace=True时，直接对df进行修改

import numpy as np
import pandas as pd
np.random.seed(1)
arr = np.random.randint(1,100,(4,4))
df = pd.DataFrame(arr,columns=['c1','c2','c3','c4'])
df.index = pd.date_range('2019-1-31', periods=4, freq='M')
print('df\n',df); print('===========')
# 按行标签降序排列
df = df.sort_index(ascending=False)
# df.sort_index(ascending=False,inplace=True)   # 这样写也行
print('按行标签降序排列\n',df); print('===========')
# 按列标签降序排列
df = df.sort_index(axis=1, ascending=False)
# df.sort_index(axis=1, ascending=False, inplace=True)  # 这样写也行
print('按列标签降序排列\n',df)
执行结果：
df
             c1  c2  c3  c4
2019-01-31  38  13  73  10
2019-02-28  76   6  80  65
2019-03-31  17   2  77  72
2019-04-30   7  26  51  21
===========
按行标签降序排列
             c1  c2  c3  c4
2019-04-30   7  26  51  21
2019-03-31  17   2  77  72
2019-02-28  76   6  80  65
2019-01-31  38  13  73  10
===========
按列标签降序排列
             c4  c3  c2  c1
2019-04-30  21  51  26   7
2019-03-31  72  77   2  17
2019-02-28  65  80   6  76
2019-01-31  10  73  13  38

（7）自定义列标签和行标签：df.columns和df.index

这两个属性都支持对其整体进行重新赋值，但不支持对其中的元素进行修改（会报错），详见本章“DataFrame对象的属性”

（8）将某一列的值设置为行标签索引：df.set_index()

语法：df.set_index('选中列的列标签', inplace=False)

将某一列的值设置为行标签索引（注意：指定列是被“剪切”到最左边当索引的，不是“复制”）

参数inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,'a',3+4j],[2,'b',5+6j]],columns=['c1','c2','c3'],index=['t1','t2'])
print(df); print('===========')
df.set_index('c2', inplace = True)
print(df)
执行结果：
    c1 c2        c3
t1   1  a  3.0+4.0j
t2   2  b  5.0+6.0j
===========
    c1        c3
c2
a    1  3.0+4.0j
b    2  5.0+6.0j

（9）重置行索引：df.reset_index()

语法：df.reset_index(inplace=False)

将行标签索引删除，并将其保存为DataFrame数据的第一列（该列对应的列标签索引为df.index.name），df.index则被重置为0、1、2……的位置索引

参数inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,3,5,7],[2,4,6,8]],columns=['c1','c2','c3','c4'],index=['t1','t2'])
df.index.name = 'myindex'	# 为df.index这个Series设置名字
print(df); print('===========')
df.reset_index(inplace=True)
print(df)
执行结果：
         c1  c2  c3  c4
myindex
t1        1   3   5   7
t2        2   4   6   8
===========
  myindex  c1  c2  c3  c4
0      t1   1   3   5   7
1      t2   2   4   6   8

（10）对列标签进行重命名：df.rename()

语法：df.rename(columns={'原列标签':'新列标签',...},inplace=False)

对列标签进行重命名。注意此方法仅能修改列标签columns，不能修改行标签index。

参数：

columns：一个字典，里面是原列标签和新列标签组成的键值对
inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np
import pandas as pd
arr = np.array([[1,2,3,4],[5,6,7,8]])
df1 = pd.DataFrame(arr,columns=['c1','c2','c3','c4'],index=['t1','t2'])
print("查看df1\n",df1); print('===========')
df2 = df1.rename(columns={'c2':'哈','c5':'嘿','t2':'哼'})
print("查看df2\n",df2)
执行结果：
查看df1
     c1  c2  c3  c4
t1   1   2   3   4
t2   5   6   7   8
===========
查看df2
     c1  哈  c3  c4
t1   1  2   3   4
t2   5  6   7   8

（11）修改某列的数据类型：astype()

语法：df['列标签'] = df['列标签'].astype('新的数据类型')

由于astype()不会对原对象本身进行修改，因此只能通过这样的赋值操作来实现修改

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2.9],[3,4.9]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('==============')
print(df.dtypes); print('==============')
df['c1'] = df['c1'].astype('float')     	# 将c1列变为float
df['c2'] = df['c2'].astype('int')       	# 将c2列变为int
print(df); print('==============')
print(df.dtypes)
执行结果：
    c1   c2
t1   1  2.9
t2   3  4.9
==============
c1      int64
c2    float64
dtype: object
==============
     c1  c2
t1  1.0   2
t2  3.0   4
==============
c1    float64
c2      int32
dtype: object

（12）删除值重复的行：df.drop_duplicates()

语法：df.drop_duplicates(subset=None, keep='first', inplace=False)

删除值重复的行，返回一个DataFrame

参数：

subset：子集，默认为None，此时两行的所有列的值都相等，才认为这两行重复；当subset='列标签'或subset=['列标签','列标签',...]时，只要指定的列的值相等，就认为这两行重复
keep：重复时保留哪一行，默认为'first'，保留第一行；当keep='last'时，保留最后一行；当keep=False时，不保留
inplace：默认值为False（不替换原df，须使用一个新的变量进行接收）；值为True时，直接在原df上修改

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3],[1,2,3],[1,2,3],[1,2,4]],columns=['c1','c2','c3'],index=['t1','t2','t3','t4'])
print(df); print('===========')
print(df.drop_duplicates()); print('===========')
print(df.drop_duplicates(keep='last')); print('===========')
print(df.drop_duplicates(keep=False)); print('===========')
print(df.drop_duplicates(subset=['c1','c2']))
执行结果：
    c1  c2  c3
t1   1   2   3
t2   1   2   3
t3   1   2   3
t4   1   2   4
===========
    c1  c2  c3
t1   1   2   3
t4   1   2   4
===========
    c1  c2  c3
t3   1   2   3
t4   1   2   4
===========
    c1  c2  c3
t4   1   2   4
===========
    c1  c2  c3
t1   1   2   3

（13）将数据的值整体平移若干行（或列）：df.shift()

语法：df.shift(periods=1,axis=0)

返回一个DataFrame，其行标签、列标签与df均相同，值向某方向平移了若干行（或列），平移导致空缺的行（或列）使用NaN填充

参数：

periods：值被平移的行数或列数，默认为1（int）
axis：默认值为0，纵向平移；axis=1时横向平移

注意：

period和axis两个参数共同决定了平移的方向：

axis=0 axis=1

period<0 向上平移向左平移

period=0 不平移不平移

period>0 向下平移向右平移
仅对DataFrame的值进行平移，行标签、列标签不跟着平移
常使用df/df.shift()计算资产价格每日收益率，见下例：

	axis=0	axis=1
period<0	向上平移	向左平移
period=0	不平移	不平移
period>0	向下平移	向右平移

import numpy as np
import pandas as pd
df = pd.DataFrame({'600001': [10,11,12,13,14],'600002': [20,21,22,23,24],'600003':[30,31,32,33,34]},index=['t1','t2','t3','t4','t5'])
print('df\n',df); print('-----------')
print('向下平移1行：df.shift()\n',df.shift()); print('-----------')
print('每日收益率df/df.shift()-1\n',df/df.shift()-1); print('===========')
print('向上平移1行：df.shift(-1)\n',df.shift(-1)); print('-----------')
print('向右平移1行：df.shift(1,axis=1)\n',df.shift(1,axis=1)); print('-----------')
print('向左平移1行：df.shift(-1,axis=1)\n',df.shift(-1,axis=1))
执行结果：
df
     600001  600002  600003
t1      10      20      30
t2      11      21      31
t3      12      22      32
t4      13      23      33
t5      14      24      34
-----------
向下平移1行：df.shift()
     600001  600002  600003
t1     NaN     NaN     NaN
t2    10.0    20.0    30.0
t3    11.0    21.0    31.0
t4    12.0    22.0    32.0
t5    13.0    23.0    33.0
-----------
每日收益率df/df.shift()-1
       600001    600002    600003
t1       NaN       NaN       NaN
t2  0.100000  0.050000  0.033333
t3  0.090909  0.047619  0.032258
t4  0.083333  0.045455  0.031250
t5  0.076923  0.043478  0.030303
===========
向上平移1行：df.shift(-1)
     600001  600002  600003
t1    11.0    21.0    31.0
t2    12.0    22.0    32.0
t3    13.0    23.0    33.0
t4    14.0    24.0    34.0
t5     NaN     NaN     NaN
-----------
向右平移1行：df.shift(1,axis=1)
     600001  600002  600003
t1     NaN    10.0    20.0
t2     NaN    11.0    21.0
t3     NaN    12.0    22.0
t4     NaN    13.0    23.0
t5     NaN    14.0    24.0
-----------
向左平移1行：df.shift(-1,axis=1)
     600001  600002  600003
t1    20.0    30.0     NaN
t2    21.0    31.0     NaN
t3    22.0    32.0     NaN
t4    23.0    33.0     NaN
t5    24.0    34.0     NaN

（14）将DataFrame转换为字典：df.to_dict()

语法：df.to_dict(orient='dict')

将DataFrame转换为字典

参数orient：转换成的字典的类型，默认值为'dict'，还可以是'list'、'series'、'split'、'records'、'index'，每个参数的效果见示例代码：

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.to_dict(orient='dict')); print('===========')
print(df.to_dict(orient='list')); print('===========')
print(df.to_dict(orient='series')); print('===========')
print(df.to_dict(orient='split')); print('===========')
print(df.to_dict(orient='records')); print('===========')
print(df.to_dict(orient='index'))
执行结果：
    c1  c2
t1   1   2
t2   3   4
===========
{'c1': {'t1': 1, 't2': 3}, 'c2': {'t1': 2, 't2': 4}}
===========
{'c1': [1, 3], 'c2': [2, 4]}
===========
{'c1': t1    1
t2    3
Name: c1, dtype: int64, 'c2': t1    2
t2    4
Name: c2, dtype: int64}
===========
{'index': ['t1', 't2'], 'columns': ['c1', 'c2'], 'data': [[1, 2], [3, 4]]}
===========
[{'c1': 1, 'c2': 2}, {'c1': 3, 'c2': 4}]
===========
{'t1': {'c1': 1, 'c2': 2}, 't2': {'c1': 3, 'c2': 4}}

（15）使用DataFrame中的若干列构建数据透视表：df.pivot()

语法：df.pivot(index='列标签', columns='列标签', values='列标签'或['列标签','列标签'...])

返回一个数据透视表形式的DataFrame

参数：

index：数据透视表的y轴（str）
columns：数据透视表的x轴（str）
values：数据透视表的数据值（str或list）

import numpy as np
import pandas as pd
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
print(df); print('===========')
df1 = df.pivot(index='foo', columns='bar', values='baz')
print(df1); print(type(df1)); print('===========')
df2 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
print(df2); print(type(df2))
执行结果：
   foo bar  baz zoo
0  one   A    1   x
1  one   B    2   y
2  one   C    3   z
3  two   A    4   q
4  two   B    5   w
5  two   C    6   t
===========
bar  A  B  C
foo
one  1  2  3
two  4  5  6
<class 'pandas.core.frame.DataFrame'>
===========
    baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t
<class 'pandas.core.frame.DataFrame'>

（16）对层次化索引的DataFrame进行变形（行标签与列标签的转换）：df.unstack()和df.stack()

详见本章“13.层次化索引 - （3）使用unstack()和stack()对层次化索引的Series和DataFrame进行变形（行标签与列标签的转换）”

9. DataFrame的空值（NaN）处理

（1）手动输入NaN的方法

输入numpy.nan
将None作为list中的一个元素，并使用此list创建DataFrame，则相应位置的None会自动变为NaN

（2）df.isnull(), df.isna(), df.notna()

语法1：df.isnull(), df.isna()

判断每个df的每个元素是否为NaN（是NaN时返回True），返回一个结构与df相同的、由布尔值组成的DataFrame

语法2：df.notna()

判断每个df的每个元素是否为NaN（不是NaN时返回True），返回一个结构与df相同的、由布尔值组成的DataFrame

import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.isnull()); print('===========')
print(df.isna()); print('===========')
print(df.notna())
执行结果：
     c1  c2
t1  NaN   2
t2  3.0   4
===========
       c1     c2
t1   True  False
t2  False  False
===========
       c1     c2
t1   True  False
t2  False  False
===========
       c1    c2
t1  False  True
t2   True  True

（3）df.dropna()

语法：df.dropna(axis=0, how='any', inplace=False)

删除df中含有NaN的行（或列）

参数：

axis：默认值为0，按行删除；axis=1时按列删除
how：默认值为'any'，只要这一行（或列）有一个元素是NaN，就删除整行（或列）；当how='all'时，必须这一行（或列）所有元素都是NaN，才删除整行（或列）
inplace：默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
df.dropna(inplace=True)
print(df)
执行结果：
     c1  c2
t1  NaN   2
t2  3.0   4
===========
     c1  c2
t2  3.0   4

（4）df.fillna()、df.ffill()、df.bfill()

对df中的NaN按照一定的规则进行替换

语法：df.fillna(value=None, method=None, axis=0, inplace=False)，它和df.ffill(axis=)与df.bfill(axis=)的关系见下面“注意”中的表格

参数：

value可以有两种形式：
- value可以是一个固定的值，表示将df中的所有NaN统一替换成这个值
- value可以是一个字典，字典的键是df的列标签，字典的值是将该列中NaN替换成的值（可以实现对不同的列中的NaN替换成不同的值）
- value还可以是一个Series或DataFrame，此时遇到NaN会寻找对应位置的元素进行填充
method是填充方法，默认为None；当method='ffill'或method='pad'时，向前填充，与df.ffill(axis=)等价；当method='bfill'时，向后填充，与df.bfill(axis=)等价
axis：坐标轴方向，默认值为0，纵向；axis=1时为横向
inplace默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

注意：

value和method两个参数能且只能输入一个，即二者不能同时为None，也不能同时不为None，否则报错

当使用method时，method和axis两个参数共同决定了寻找填充值的方向：

	method='ffill'	method='bfill'
axis=0	取上方单元格的值填充NaN 等价于`df.ffill()`	取下方单元格的值填充NaN 等价于`df.bfill()`
axis=1	取左侧单元格的值填充NaN 等价于`df.ffill(axis=1)`	取右侧单元格的值填充NaN 等价于`df.bfill(axis=1)`

当指向的单元格仍为NaN时，继续越过该单元格向相同方向寻找填充值，若直到DataFrame的边界都是NaN，则不再进行填充，保留这些NaN

import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,101,102,np.nan],
                   [110,np.nan,112,113],
                   [120,121,np.nan,123],
                   [np.nan,131,132,np.nan]],
                  columns=['c1','c2','c3','c4'],index=['t1','t2','t3','t4'])
df0 = df.fillna(0)
df1 = df.fillna({'c1':-1,'c4':-4})
df2 = df.fillna(method='ffill')
df3 = df.fillna(method='bfill')
df4 = df.fillna(method='ffill', axis=1)
df5 = df.fillna(method='bfill', axis=1)
df6 = df.ffill()
df7 = df.bfill()
df8 = df.ffill(axis=1)
df9 = df.bfill(axis=1)
print('df\n',df); print('===========')
print('df0(替换为0)\n',df0); print('===========')
print('df1(基于字典)\n',df1); print('===========')
print('df2(向上)\n',df2); print('===========')
print('df3(向下)\n',df3); print('===========')
print('df4(向左)\n',df4); print('===========')
print('df5(向右)\n',df5); print('===========')
print('df6(向上)\n',df6); print('===========')
print('df7(向下)\n',df7); print('===========')
print('df8(向左)\n',df8); print('===========')
print('df9(向右)\n',df9)
执行结果：
df
        c1     c2     c3     c4
t1    NaN  101.0  102.0    NaN
t2  110.0    NaN  112.0  113.0
t3  120.0  121.0    NaN  123.0
t4    NaN  131.0  132.0    NaN
===========
df0(替换为0)
        c1     c2     c3     c4
t1    0.0  101.0  102.0    0.0
t2  110.0    0.0  112.0  113.0
t3  120.0  121.0    0.0  123.0
t4    0.0  131.0  132.0    0.0
===========
df1(基于字典)
        c1     c2     c3     c4
t1   -1.0  101.0  102.0   -4.0
t2  110.0    NaN  112.0  113.0
t3  120.0  121.0    NaN  123.0
t4   -1.0  131.0  132.0   -4.0
===========
df2(向上)
        c1     c2     c3     c4
t1    NaN  101.0  102.0    NaN
t2  110.0  101.0  112.0  113.0
t3  120.0  121.0  112.0  123.0
t4  120.0  131.0  132.0  123.0
===========
df3(向下)
        c1     c2     c3     c4
t1  110.0  101.0  102.0  113.0
t2  110.0  121.0  112.0  113.0
t3  120.0  121.0  132.0  123.0
t4    NaN  131.0  132.0    NaN
===========
df4(向左)
        c1     c2     c3     c4
t1    NaN  101.0  102.0  102.0
t2  110.0  110.0  112.0  113.0
t3  120.0  121.0  121.0  123.0
t4    NaN  131.0  132.0  132.0
===========
df5(向右)
        c1     c2     c3     c4
t1  101.0  101.0  102.0    NaN
t2  110.0  112.0  112.0  113.0
t3  120.0  121.0  123.0  123.0
t4  131.0  131.0  132.0    NaN
===========
df6(向上)
        c1     c2     c3     c4
t1    NaN  101.0  102.0    NaN
t2  110.0  101.0  112.0  113.0
t3  120.0  121.0  112.0  123.0
t4  120.0  131.0  132.0  123.0
===========
df7(向下)
        c1     c2     c3     c4
t1  110.0  101.0  102.0  113.0
t2  110.0  121.0  112.0  113.0
t3  120.0  121.0  132.0  123.0
t4    NaN  131.0  132.0    NaN
===========
df8(向左)
        c1     c2     c3     c4
t1    NaN  101.0  102.0  102.0
t2  110.0  110.0  112.0  113.0
t3  120.0  121.0  121.0  123.0
t4    NaN  131.0  132.0  132.0
===========
df9(向右)
        c1     c2     c3     c4
t1  101.0  101.0  102.0    NaN
t2  110.0  112.0  112.0  113.0
t3  120.0  121.0  123.0  123.0
t4  131.0  131.0  132.0    NaN

（5）df.interpolate()

interpolate：v.插值

语法：df.interpolate(method='linear', axis=0, inplace=False)

使用插值法对df中的NaN进行替换

参数：

method：插值方法，默认为'linear'，线性插值，也可以选择其他方法
axis：插值方向，默认为0，纵向插值；axis=0时为横向插值
inplace默认值为False，此时不会对df本身进行修改，需要额外定义一个变量来接收结果；而当inplace=True时，则会直接对df本身进行修改

线性插值时，df边界行（或边界列）上的NaN的处理方法：

若axis=0，则上边界行上的NaN保持不变，下边界行上的NaN取其上方的值对NaN进行替换（向上填充）
若axis=1，则左边界列上的NaN保持不变，右边界列上的NaN取其左侧的值对NaN进行替换（向左填充）

关于时间序列低频调整为高频（升采样）的插值法，见本章“14. Pandas中的时间相关格式及方法-（9）df.resample()-低频调整为高频（升采样）：通过线性插值实现”

import numpy as np
import pandas as pd
df = pd.DataFrame([[10,np.nan,30],[np.nan,50,np.nan],[70,np.nan,90]],columns=['c1','c2','c3'],index=['t1','t2','t3'])
df1 = df.interpolate()
df2 = df.interpolate(axis=1)
print('df\n',df); print('===========')
print('df1\n',df1); print('===========')
print('df2\n',df2)
执行结果：
df
       c1    c2    c3
t1  10.0   NaN  30.0
t2   NaN  50.0   NaN
t3  70.0   NaN  90.0
===========
df1
       c1    c2    c3
t1  10.0   NaN  30.0
t2  40.0  50.0  60.0
t3  70.0  50.0  90.0
===========
df2
       c1    c2    c3
t1  10.0  20.0  30.0
t2   NaN  50.0  50.0
t3  70.0  80.0  90.0

10. DataFrame对象的方法和Pandas模块的方法

（0）一元通用函数

由于Pandas底层是NumPy，所以大部分numpy模块中的通用函数都适用于df，如：

求平方根：np.sqrt(df)

四舍五入：df.round(n)、np.round(df,n)

（1）一元通用函数补充

① 判断df中的每个元素是否在指定列表里

语法：df.isin(list)

返回一个与df结构相同的、由布尔值组成的DataFrame

  import numpy as np
import pandas as pd
  df = pd.DataFrame([[1,2],[3,4]],columns=['c1','c2'],index=['t1','t2'])
print(df.isin([2,3,5,7,11,13,17]))
执行结果
         c1     c2
t1  False   True
  t2   True  False

② 计算纵向相对百分比变化

语法：为df.pct_change(periods=1)

返回一个DataFrame，里面的每一项都是相对百分比变化，即：

\[新的DataFrame第m行n列的值=\frac{df第m行n列的值}{df第(m-periods)行n列的值}
\]

periods是从分子到分母所移动的行数，默认值为1。当periods>0时，相当于用某单元格的值 / 它上面某单元格的值；当periods<0时，相当于用某单元格的值 / 它下面某单元格的值

常使用df.pct_change()计算资产价格的每日收益率

import numpy as np
import pandas as pd
df = pd.DataFrame({'600001': [10,11,12,13,14],'600002': [20,21,22,23,24],'600003':[30,31,32,33,34]},index=['t1','t2','t3','t4','t5'])
print('df\n',df); print('===========')
print(df.pct_change()); print('===========')	# 常见的收益率计算方式
print(df.pct_change(-1))
执行结果：
df
     600001  600002  600003
t1      10      20      30
t2      11      21      31
t3      12      22      32
t4      13      23      33
t5      14      24      34
===========
      600001    600002    600003
t1       NaN       NaN       NaN
t2  0.100000  0.050000  0.033333
t3  0.090909  0.047619  0.032258
t4  0.083333  0.045455  0.031250
t5  0.076923  0.043478  0.030303
===========
      600001    600002    600003
t1 -0.090909 -0.047619 -0.032258
t2 -0.083333 -0.045455 -0.031250
t3 -0.076923 -0.043478 -0.030303
t4 -0.071429 -0.041667 -0.029412
t5       NaN       NaN       NaN

（2）二元通用函数

由于Pandas底层是NumPy，所以大部分numpy模块中的通用函数都适用于df

（3）统计相关方法

① df.min()和df.idxmin()

语法：df.min(axis=0)

按列（或按行）计算给定的DataFrame中的最小值，返回一个Series

参数axis：

默认值为0，即按列求最小。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列的最小值
若axis=1，则按行求最小。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行的最小值

语法：df.idxmin(axis=0)

按列（或按行）计算给定的DataFrame中的最小值所对应的行标签（或列标签），返回一个Series

参数axis：默认值为0，即按列求最小；若axis=1，则按行求最小

注意：df.argmin()方法已弃用，改为df.idxmin()方法

import numpy as np
import pandas as pd
arr = np.array([[1,2],[4,3]])
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.min()); print('-----------')       # 寻找每一列最小的值
print(df.min(axis=1)); print('===========') # 寻找每一行最小的值
print(df.idxmin()); print('-----------')    # 寻找每一列最小的值对应的行标签
print(df.idxmin(axis=1))                    # 寻找每一行最小的值对应的行标签
执行结果：
    c1  c2
t1   1   2
t2   4   3
===========
c1    1
c2    2
dtype: int32
-----------
t1    1
t2    3
dtype: int32
===========
c1    t1
c2    t1
dtype: object
-----------
t1    c1
t2    c2
dtype: object

② df.max()和df.idxmax()

语法：df.max(axis=0)

按列（或按行）计算给定的DataFrame中的最大值，返回一个Series

参数axis：

默认值为0，即按列求最大。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列的最大值
若axis=1，则按行求最大。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行的最大值

语法：df.idxmax(axis=0)

按列（或按行）计算给定的DataFrame中的最大值所对应的行标签（或列标签），返回一个Series

参数axis：默认值为0，即按列求最大；若axis=1，则按行求最大

注意：df.argmax()方法已弃用，改为df.idxmax()方法

import numpy as np
import pandas as pd
arr = np.array([[1,2],[4,3]])
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.max()); print('-----------')       # 寻找每一列最大的值
print(df.max(axis=1)); print('===========') # 寻找每一行最大的值
print(df.idxmax()); print('-----------')    # 寻找每一列最大的值对应的行标签
print(df.idxmax(axis=1))                    # 寻找每一行最大的值对应的行标签
执行结果：
    c1  c2
t1   1   2
t2   4   3
===========
c1    4
c2    3
dtype: int32
-----------
t1    2
t2    4
dtype: int32
===========
c1    t2
c2    t2
dtype: object
-----------
t1    c2
t2    c1
dtype: object

③ df.sum()

语法：df.sum(axis=0)

按列（或按行）对给定的DataFrame中的数据求和，返回一个Series

参数axis：

默认值为0，即按列求和。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列所有数据的和
若axis=1，则按行求和。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行所有数据的和

import numpy as np
import pandas as pd
arr = np.array([[1,2],[3,4]])
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.sum(),type(df.sum())); print('===========')    # 默认按列求和
print(df.sum(axis=1))                    				# 按行求和
执行结果：
    c1  c2
t1   1   2
t2   3   4
===========
c1    4
c2    6
dtype: int64 <class 'pandas.core.series.Series'>
===========
t1    3
t2    7

④ df.mean()

语法：df.mean(axis=0)

按列（或按行）对给定的DataFrame中的数据求算数平均，返回一个Series

参数axis：

默认值为0，即按列求平均。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列所有数据的算数平均
若axis=1，则按行求平均。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行所有数据的算数平均

import numpy as np
import pandas as pd
arr = np.array([[1,2],[3,4]])
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.mean(),type(df.mean())); print('===========')    # 默认按列求平均
print(df.mean(axis=1),type(df.mean()))                    # 按行求平均
执行结果：
    c1  c2
t1   1   2
t2   3   4
===========
c1    2.0
c2    3.0
dtype: float64 <class 'pandas.core.series.Series'>
===========
t1    1.5
t2    3.5
dtype: float64 <class 'pandas.core.series.Series'>

⑤ df.count()

语法：df.count(axis=0)

按列（或按行）统计给定的DataFrame中非空数据的个数，返回一个Series

参数axis：

默认值为0，即按列计数。当给定的DataFrame为m行n列时，返回的Series长度为n，其索引为DataFrame的列标签，值为该列非空数据的个数
若axis=1，则按行计数。当给定的DataFrame为m行n列时，返回的Series长度为m，其索引为DataFrame的行标签，值为该行非空数据的个数

import numpy as np
import pandas as pd
arr = np.array([[np.nan,2],[3,4]])							# 手动输入NaN的方式
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2'])
print(df); print('===========')
print(df.count(),type(df.count())); print('===========')    # 默认按列计数
print(df.count(axis=1))                    					# 按行计数
执行结果：
     c1   c2
t1  NaN  2.0
t2  3.0  4.0
===========
c1    1
c2    2
dtype: int64 <class 'pandas.core.series.Series'>
===========
t1    1
t2    2
dtype: int64

⑥ 累积计算：df.cumsum()、df.cumprod()、df.cummax()、df.cummin()

语法：df.cumXXX(axis=0)

按列（或按行）计算给定的DataFrame累积和、累积积、累积最大、累积最小，返回一个DataFrame

参数axis：默认值为0，即按列计算；若axis=1，则按行计算

注意：

累积最大指从第一个数据到当前数据这段区间内的最大，累积最小同理
累积最大、累积最小是DataFrame特有的方法，numpy.ndarray没有这两种方法
累积最大常用于计算最大回撤（max drawdown）
和numpy.ndarray不同，DataFrame不会将多维数据变为一维再计算其累积值

import numpy as np
import pandas as pd
df = pd.DataFrame({'c1':[1,-2,3,4,-5,6],'c2':[10,20,-30,-40,50,-60]},index=['t1','t2','t3','t4','t5','t6'])
df1 = df.cumsum()
df2 = df.cumsum(axis=1)
df3 = df.cumprod()
df4 = df.cummax()
df5 = df.cummin()
print('df\n', df); print('===========')
print('df1\n', df1); print('===========')
print('df2\n', df2); print('===========')
print('df3\n', df3); print('===========')
print('df4\n', df4); print('===========')
print('df5\n', df5)
执行结果：
df
     c1  c2
t1   1  10
t2  -2  20
t3   3 -30
t4   4 -40
t5  -5  50
t6   6 -60
===========
df1
     c1  c2
t1   1  10
t2  -1  30
t3   2   0
t4   6 -40
t5   1  10
t6   7 -50
===========
df2
     c1  c2
t1   1  11
t2  -2  18
t3   3 -27
t4   4 -36
t5  -5  45
t6   6 -54
===========
df3
      c1         c2
t1    1         10
t2   -2        200
t3   -6      -6000
t4  -24     240000
t5  120   12000000
t6  720 -720000000
===========
df4
     c1  c2
t1   1  10
t2   1  20
t3   3  20
t4   4  20
t5   4  50
t6   6  50
===========
df5
     c1  c2
t1   1  10
t2  -2  10
t3  -2 -30
t4  -2 -40
t5  -5 -40
t6  -5 -60

⑦ df.corr() 相关系数矩阵

语法：df.corr()

返回df每一列的相关系数矩阵（DataFrame格式）

注意：也可以直接使用Series的corr()方法计算两个Series的相关系数，其用法为s1.corr(s2)

# 计算上证指数、深圳成指、沪深300指数的相关系数矩阵
import numpy as np
import pandas as pd
import tushare as ts
df_close = pd.DataFrame({
    'sh':ts.get_k_data('sh', start='2019-01-01',end='2019-06-30')['close'],
    'sz':ts.get_k_data('sz', start='2019-01-01',end='2019-06-30')['close'],
    'hs300':ts.get_k_data('hs300', start='2019-01-01',end='2019-06-30')['close'],
})
df_return = df_close.pct_change().fillna(0)
df_corr = df_return.corr()
print(df_return.head()); print('===========')
print(df_corr); print(type(df_corr))
执行结果：
         sh        sz     hs300
0  0.000000  0.000000  0.000000
1 -0.000377 -0.008369 -0.001583
2  0.020496  0.027562  0.023957
3  0.007245  0.015836  0.006071
4 -0.002617 -0.001155 -0.002161
===========
             sh        sz     hs300
sh     1.000000  0.953529  0.976994
sz     0.953529  1.000000  0.941665
hs300  0.976994  0.941665  1.000000
<class 'pandas.core.frame.DataFrame'>

⑧ df.describe()

语法：df.describe(percentiles=None)

按列统计给定的DataFrame中各项描述性统计信息（包括count()、mean()、std()、min()、50%、max()和自定义的百分位数），返回一个DataFrame

参数percentiles：自定义的百分比列表，是一个由float组成的list。默认值为None，此时默认的自定义百分比是25%、75%。

注意：此方法无axis参数，无法按行统计

import numpy as np
import pandas as pd
arr = np.array([[np.nan,2],[3,4],[5,6]])
df = pd.DataFrame(arr,columns=['c1','c2'],index=['t1','t2','t3'])
print(df); print('===========')
print(df.describe(),'\n',type(df.describe())); print('===========')
print(df.describe(percentiles=[0.05,0.95]),'\n',type(df.describe(percentiles=[0.05,0.95])))
执行结果：
     c1   c2
t1  NaN  2.0
t2  3.0  4.0
t3  5.0  6.0
===========
             c1   c2
count  2.000000  3.0
mean   4.000000  4.0
std    1.414214  2.0
min    3.000000  2.0
25%    3.500000  3.0
50%    4.000000  4.0
75%    4.500000  5.0
max    5.000000  6.0
 <class 'pandas.core.frame.DataFrame'>
===========
             c1   c2
count  2.000000  3.0
mean   4.000000  4.0
std    1.414214  2.0
min    3.000000  2.0
5%     3.100000  2.2
50%    4.000000  4.0
95%    4.900000  5.8
max    5.000000  6.0
 <class 'pandas.core.frame.DataFrame'>

⑨ df.resample()：重采样

见本章“14. Pandas中的时间相关格式及方法-（9）df.resample()-低频调整为高频（升采样）：通过线性插值实现”

此外，关于DataFrame填充空值的插值法df.interpolate()，见本章“9. DataFrame的空值（NaN）处理 - （5）df.interpolate()”

⑩ df.rolling()：滑动时间窗

见本章“14. Pandas中的时间相关格式及方法 -（10）df.rolling()：滑动时间窗”

（4）将DataFrame存储为本地文件

包括两类方法：

df.to_xxx()系列方法
基于HDF5的存储方法

详见“AQF笔记-第2部分-第7章-金融数据源处理实现-二、金融数据的存储”

（5）其他重要方法

① df.apply()

语法：df.apply(func, axis=0)

将df逐列（或逐行）以Series的形式传递给func作为其参数并执行func()，并将每次func()的返回值组成一个Series，作为df.apply()整体的返回值。

参数：

func：已定义的函数名，也可以是一个匿名函数
axis：默认值为0，逐列传递；axis=1时，逐行传递

此外，对于DataFrame分组聚合时建立的分组对象group_obj，也有类似的apply()方法（一个不同之处在于group_obj.apply(func)没有axis参数），其原理及应用详见本章“11. DataFrame的分组、聚合 - （3）分组对象group_obj的应用 - ⑤ 使用group_obj.apply()”

import numpy as np
import pandas as pd
# 从CSV文件读取数据，进行处理后，保留前5行的'code','name','roe'三列
data = pd.read_csv('2019Q1.csv')
data = data.sort_values('code').reset_index()
data = data[['code','name','roe']].head(5)
print(data); print('===========')
# 按ROE对股票进行分类的函数
def map_func(x):
    print('----map_func内部开始----')
    print(x)
    print(type(x))
    print('----map_func内部结束----')
    if x['roe'] > 4:
        return '高成长'
    elif x['roe'] >= 0:
        return '低成长'
    elif x['roe'] < 0:
        return '亏损'
# 执行data.apply()，axis=1代表按行取Series传给map_func()
result = data.apply(map_func, axis=1)
print('===========')
print(result)
print(type(result))
print('===========')
# 根据 ROE 数据计算“成长性”，并将此列添加到data
data['成长性'] = result    # 相当于data['成长性'] = data.apply(map_func, axis=1)
print(data)
执行结果：
   code  name   roe
0     1  平安银行  2.96
1     2   万科A  0.71
2     4  国农科技  4.66
3     5  世纪星源 -1.20
4     6  深振业A  1.75
===========
----map_func内部开始----
code       1
name    平安银行
roe     2.96
Name: 0, dtype: object
<class 'pandas.core.series.Series'>
----map_func内部结束----
----map_func内部开始----
code       2
name     万科A
roe     0.71
Name: 1, dtype: object
<class 'pandas.core.series.Series'>
----map_func内部结束----
----map_func内部开始----
code       4
name    国农科技
roe     4.66
Name: 2, dtype: object
<class 'pandas.core.series.Series'>
----map_func内部结束----
----map_func内部开始----
code       5
name    世纪星源
roe     -1.2
Name: 3, dtype: object
<class 'pandas.core.series.Series'>
----map_func内部结束----
----map_func内部开始----
code       6
name    深振业A
roe     1.75
Name: 4, dtype: object
<class 'pandas.core.series.Series'>
----map_func内部结束----
===========
0    低成长
1    低成长
2    高成长
3     亏损
4    低成长
dtype: object
<class 'pandas.core.series.Series'>
===========
   code  name   roe  成长性
0     1  平安银行  2.96  低成长
1     2   万科A  0.71  低成长
2     4  国农科技  4.66  高成长
3     5  世纪星源 -1.20   亏损
4     6  深振业A  1.75  低成长

② df.applymap()

语法：df.applymap(func)

将df中的每个元素分别传递给func作为其参数并执行func()，并将每次func()的返回值组成一个结构相同的新的DataFrame，作为df.applymap()整体的返回值。

参数：func：已定义的函数名，也可以是一个匿名函数

import numpy as np
import pandas as pd
df1 = pd.DataFrame([[10,20],[30,40]],columns=['c1','c2'],index=['t1','t2'])
df2 = df1.applymap(lambda x:x+1)
print(df1); print('===========')
print(df2)
执行结果：
    c1  c2
t1  10  20
t2  30  40
===========
    c1  c2
t1  11  21
t2  31  41

关于apply()、applymap()和map()的总结：

	apply()	applymap()	map()
Python内置函数	NA	NA	遍历每一个元素
Series方法	遍历每一个元素	NA	遍历每一个元素
DataFrame方法	遍历行或列	遍历每一个元素	NA

③ df.iterrows()

语法：df.iterrows()

返回一个生成器（generator），该生成器使用df逐行生成一个元组，其中元组第0项是df的行索引，元组第1项是该行数据组成的Series（df的值也是Series的值，df的列索引则是Series的索引）。上述索引中，有标签索引的优先使用标签索引，否则使用位置索引。通常使用两个变量以拆包的方式分别接收两个返回值。

df.iterrows() 通常用于循环遍历df的每一行数据。

import numpy as np
import pandas as pd
df = pd.DataFrame([[10,20,30],[40,50,60],[70,80,90]],columns=['c1','c2','c3'],index=['t1','t2','t3'])
print(df); print('===========')
print(df.iterrows()); print('===========')
for i,j in df.iterrows():
    print(i); print('-----------')
    print(type(i)); print('-----------')
    print(j); print('-----------')
    print(type(j)); print('===========')
执行结果：
    c1  c2  c3
t1  10  20  30
t2  40  50  60
t3  70  80  90
===========
<generator object DataFrame.iterrows at 0x000000001431ECA8>
===========
t1
-----------
<class 'str'>
-----------
c1    10
c2    20
c3    30
Name: t1, dtype: int64
-----------
<class 'pandas.core.series.Series'>
===========
t2
-----------
<class 'str'>
-----------
c1    40
c2    50
c3    60
Name: t2, dtype: int64
-----------
<class 'pandas.core.series.Series'>
===========
t3
-----------
<class 'str'>
-----------
c1    70
c2    80
c3    90
Name: t3, dtype: int64
-----------
<class 'pandas.core.series.Series'>
===========

④ df.all()

语法：df.all(axis=0)

返回一个布尔值组成的Series

参数axis：

默认值为0，按列计算，Series的索引为df的列索引，当df中该列所有值均为True时，Series中对应项为True，否则为False
若axis=1，按行计算，Series的索引为df的行索引，当df中该行所有值均为True时，Series中对应项为True，否则为False

import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,1]],columns=['c1','c2'],index=['t1','t2'])
b = pd.DataFrame([[1,1],[1,1]],columns=['c1','c2'],index=['t1','t2'])
print(a); print('-----------')
print(a.all()); print('-----------')
print(a.all(axis=1)); print('===========')
print(b); print('-----------')
print(b.all()); print('-----------')
print(b.all(axis=1))
执行结果：
    c1  c2
t1   0   0
t2   1   1
-----------
c1    False
c2    False
dtype: bool
-----------
t1    False
t2     True
dtype: bool
===========
    c1  c2
t1   1   1
t2   1   1
-----------
c1    True
c2    True
dtype: bool
-----------
t1    True
t2    True
dtype: bool

⑤ df.any()

语法：df.any(axis=0)

返回一个布尔值组成的Series

参数axis：

默认值为0，按列计算，Series的索引为df的列索引，当df中该列任意一个值为True时，Series中对应项为True，否则为False
若axis=1，按行计算，Series的索引为df的行索引，当df中该行任意一个值为True时，Series中对应项为True，否则为False

import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,1]],columns=['c1','c2'],index=['t1','t2'])
b = pd.DataFrame([[0,0],[0,0]],columns=['c1','c2'],index=['t1','t2'])
print(a); print('-----------')
print(a.any()); print('-----------')
print(a.any(axis=1)); print('===========')
print(b); print('-----------')
print(b.any()); print('-----------')
print(b.any(axis=1))
执行结果：
    c1  c2
t1   0   0
t2   1   1
-----------
c1    True
c2    True
dtype: bool
-----------
t1    False
t2     True
dtype: bool
===========
    c1  c2
t1   0   0
t2   0   0
-----------
c1    False
c2    False
dtype: bool
-----------
t1    False
t2    False
dtype: bool

⑥ `df.len()`和len(df)

返回df的长度（int类型），它等于df的行数，即df.shape()返回的元组的第0项

⑦ df.head()

语法：df.head(n=5)

获取df的前n行，n的默认值为5，返回DataFrame类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

该方法主要用于快速预览一个行数较多的DataFrame

⑧ df.tail()

语法：df.tail(n=5)

获取df的后n行，n的默认值为5，返回DataFrame类型。示例代码见“6.DataFrame的数据选择 - （5）使用df的属性和方法进行数据选择”

该方法主要用于快速预览一个行数较多的DataFrame

⑨ df.info()

自动在屏幕输出df的一些基本信息

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2],[3,4]], columns=['c1','c2'], index=['t1','t2'])
df.info()
执行结果：
<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, t1 to t2
Data columns (total 2 columns):
c1    2 non-null int64
c2    2 non-null int64
dtypes: int64(2)
memory usage: 48.0+ bytes

⑩ df.duplicated()

语法：df.duplicated(subset=None, keep='first')

判断df中的每一行的值是否重复，返回一个bool组成的Series

参数：

subset：子集，默认为None，此时两行的所有列的值都相等，才认为这两行重复；当subset='列标签'或subset=['列标签','列标签',...]时，只要指定的列的值相等，就认为这两行重复
keep：重复时的标记方式，默认为'first'
- 'first'：在重复的行中，除了第一行标记为False，其他行都标记为True
- 'last'：在重复的行中，除了最后一行标记为False，其他行都标记为True
- False：在重复的行中，所有行均标记为True

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,3],[2,3],[2,3],[1,4],[2,4]])
df.columns = ['c1','c2']
df.index = ['t1','t2','t3','t4','t5']
print(df)
print('---\n',df.duplicated())                      # 所有列值都相等视为重复
print('---\n',df.duplicated(subset=['c1']))         # 'c1'列值相等视为重复
print('---\n',df.duplicated(subset=['c2']))         # 'c1'列值相等视为重复
print('---\n',df.duplicated(subset=['c1','c2']))    # 'c1','c2'列值都相等视为重复
执行结果：
    c1  c2
t1   1   3
t2   2   3
t3   2   3
t4   1   4
t5   2   4
---
 t1    False
t2    False
t3     True
t4    False
t5    False
dtype: bool
---
 t1    False
t2    False
t3     True
t4     True
t5     True
dtype: bool
---
 t1    False
t2     True
t3     True
t4    False
t5     True
dtype: bool
---
 t1    False
t2    False
t3     True
t4    False
t5    False
dtype: bool

⑾ df.rank()

语法：

df.rank(axis=0,method='average',numeric_only=None,na_option='keep',ascending=True,pct=False)

返回一个形状与df相同的DataFrame，里面的数据是其在本列（或本行）所有数据中的排名

参数：

axis：默认值为0，按列数据计算排名；axis=1时，按行数据计算排名
method：存在并列时排名的计算方法，默认值为'average'，取值可以为'average'，'first'，'min'， 'max'，'dense'。假设参与升序排名的数据为100、150、150、200：
- average：平均排名，当存在并列时，取这些并列项的顺序排名的平均值（1、2.5、2.5、4）
- first：顺序排名，当存在并列时，谁在DataFrame中的顺序靠前，谁的顺序排名也靠前（1、2、3、4），注意method='first'时不支持非数字类型的排名
- min：最小排名，当存在并列时，取这些并列项的顺序排名的最小值（1、2、2、4）
- max：最大排名，当存在并列时，取这些并列项的顺序排名的最大值（1、3、3、4）
- dense：密集排名，后一项的排名总是与前一项相同或加一，不跳跃（1、2、2、3）
numeric_only：bool，是否仅仅计算数字类型的columns
na_option：NaN值是否参与排序及如何排序，默认值'keep'，取值可以为'keep'、'top'、'bottom'：
- 'keep'：NaN的排名还是NaN
- 'top'：把NaN放在排名首位
- 'bottom'：把NaN放在排名末位
ascending：bool，是否升序，默认值True
pct：bool，是否以百分比方式显示排名，默认值False

注意：df.rank()只能实现每个字段分别排名，无法实现多字段联合排名，后者的功能需要通过df.groupby(['排序字段1','排序字段2',...]).ngroup()实现

import numpy as np
import pandas as pd
df = pd.DataFrame({'animal': ['cat', 'penguin', 'dog','spider', 'snake'],
                   'legs': [4, 2, 4, 8, np.nan]})
print("df\n",df,'\n----')
print("df.rank(method='average')\n",df.rank(method='average'),'\n----')
print("df.legs.rank(method='first')\n",df.legs.rank(method='first'),'\n----')
print("df.rank(method='min')\n",df.rank(method='min'),'\n----')
print("df.rank(method='max')\n",df.rank(method='max'),'\n----')
print("df.rank(method='dense')\n",df.rank(method='dense'),'\n====')
print("df.rank(method='min',na_option='top')\n",df.rank(method='min',na_option='top'),'\n----')
print("df.rank(method='min',na_option='bottom')\n",df.rank(method='min',na_option='bottom'),'\n====')
print("df.rank(method='min',pct=True)\n",df.rank(method='min',pct=True))
执行结果：
df
     animal  legs
0      cat   4.0
1  penguin   2.0
2      dog   4.0
3   spider   8.0
4    snake   NaN
----
df.rank(method='average')
    animal  legs
0     1.0   2.5
1     3.0   1.0
2     2.0   2.5
3     5.0   4.0
4     4.0   NaN
----
df.legs.rank(method='first')
 0    2.0
1    1.0
2    3.0
3    4.0
4    NaN
Name: legs, dtype: float64
----
df.rank(method='min')
    animal  legs
0     1.0   2.0
1     3.0   1.0
2     2.0   2.0
3     5.0   4.0
4     4.0   NaN
----
df.rank(method='max')
    animal  legs
0     1.0   3.0
1     3.0   1.0
2     2.0   3.0
3     5.0   4.0
4     4.0   NaN
----
df.rank(method='dense')
    animal  legs
0     1.0   2.0
1     3.0   1.0
2     2.0   2.0
3     5.0   3.0
4     4.0   NaN
====
df.rank(method='min',na_option='top')
    animal  legs
0     1.0   3.0
1     3.0   2.0
2     2.0   3.0
3     5.0   5.0
4     4.0   1.0
----
df.rank(method='min',na_option='bottom')
    animal  legs
0     1.0   2.0
1     3.0   1.0
2     2.0   2.0
3     5.0   4.0
4     4.0   5.0
====
df.rank(method='min',pct=True)
    animal  legs
0     0.2  0.50
1     0.6  0.25
2     0.4  0.50
3     1.0  1.00
4     0.8   NaN

11. DataFrame的分组、聚合

（1）分组对象group_obj的创建

DataFrame的分组、聚合都是基于分组对象group_obj实现的（它是pandas.core.groupby.generic. DataFrameGroupBy类的一个实例化对象），因此首先应创建分组对象group_obj，语法为：

按照一列进行分组时：group_obj = df.groupby('列标签')
按照多列联合分组时：group_obj = df.groupby(['列标签','列标签',...])

（2）分组对象group_obj的构成

假设按照创建分组对象group_obj时定义的分组规则，一共可以分成n组，则group_obj就是一个由n个元组组成的可迭代对象，其中每个元组的第0项是一种分组下的列标签（或元组形式的多个列标签的组合），每个元组第1项是该分组下对应的DataFrame。

（3）分组对象group_obj的应用

① 使用group_obj的指定方法查看df的一项统计信息

group_obj.size()：每个分组的记录数（含带空值的记录）（Series类型）
group_obj.max()：每个分组的最大值（DataFrame类型）
group_obj.min()：每个分组的最小值（DataFrame类型）
group_obj.sum()：每个分组的求和（DataFrame类型）
group_obj.mean()：每个分组的平均值（DataFrame类型）
group_obj.std()：每个分组的标准差（DataFrame类型）
group_obj.count()：每个分组的非空记录数（DataFrame类型）
group_obj.cumsum()：逐行的累加和（与原DataFrame行数相等的DataFrame）
group_obj.cumprod()：逐行的累乘积（与原DataFrame行数相等的DataFrame）
group_obj.cumcount()：从0开始的逐行的累加计数（与原DataFrame行数相等的Series）

② 使用group_obj.describe()查看df的全部统计信息

group_obj.describe()：横向查看，因表格过长，不推荐（DataFrame类型）
group_obj.describe().T：纵向查看，推荐（DataFrame类型）

③ 使用group_obj.agg()查看自定义项目的统计信息

group_obj.agg([np.mean, np.std])：查看所有字段的均值和标准差（DataFrame类型）
group_obj.agg({'c1':np.mean, 'c2':np.std})：查看'c1'字段的均值和'c2'字段的标准差（DataFrame类型）

④ 使用group_obj.get_group()获取指定分组的数据

按照一列进行分组时获取数据：group_obj.get_group('分组字段的某个值')
按照多列联合分组时获取数据：group_obj.get_group(('分组字段1的某个值','分组字段2的某个值',...))

⑤ 使用group_obj.ngroup()实现多字段联合排名

语法：group_obj.ngroup(ascending=True)

返回一个Series。根据创建group_obj时的几个分组字段，比较每个字段的值，当第一个字段的值相等时比较第二个字段，当第二个字段相等时比较第三个字段……若所有分组字段的值都相等，则两条记录的排名也相同（注意：排名是从0开始的）

import numpy as np
import pandas as pd
# 创建DataFrame
data = [('a', 70, 5),
        ('b', 80, 4),
        ('c', 70, 4),
        ('d', 70, 5),
        ('e', 80, 5),
        ('f', 75, 4)]
df = pd.DataFrame(data, columns=['name', 'score', 'homework'])
df = df.sort_values(['score', 'homework'], ascending=False)
print(df, '\n----------')
s = df.groupby(['score', 'homework']).ngroup(ascending=False)
print(s, '\n----------\n', type(s), '\n----------')
df['ranking'] = s + 1
print(df)
执行结果：
  name  score  homework
4    e     80         5
1    b     80         4
5    f     75         4
0    a     70         5
3    d     70         5
2    c     70         4
----------
4    0
1    1
5    2
0    3
3    3
2    4
dtype: int64
----------
 <class 'pandas.core.series.Series'>
----------
  name  score  homework  ranking
4    e     80         5        1
1    b     80         4        2
5    f     75         4        3
0    a     70         5        4
3    d     70         5        4
2    c     70         4        5

⑥ 使用group_obj.apply()实现自定义功能

语法：group_obj.apply(func)

参数func：已定义的函数名，也可以是一个匿名函数

将构成group_obj的每个元组中的DataFrame分别传递给func作为其参数并执行func()，然后将每次func()的返回值纵向拼接成一个DataFrame或Series，并为其在列标签的最外层添加分组列组成的列标签索引（联合分组时则在列标签的最外层添加多层列标签索引），然后将其作为group_obj.apply()整体的返回值。group_obj.apply()的应用见下面的例3。

此外，对于DataFrame对象，也有类似的apply()方法（一个不同之处在于df.apply(func,axis=0)有axis参数），其原理及应用详见本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他方法 - ① df.apply()”

# 例1：本例仅用于研究分组对象group_obj的构成（group_obj的应用见例2）
import numpy as np
import pandas as pd
# 创建数据集
np.random.seed(0)
period = pd.date_range('2019-9-22', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)
df['g1'] = np.random.choice(['M', 'N'], 1000)
df['g2'] = np.random.choice(['X', 'Y'], 1000)
for i in period:    # 随机产生空值
    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan
    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan
print('df.head()\n',df.head()); print('===========')
# 创建分组对象（由于这里是研究分组对象group_obj的构成，所以仅查看按照两列联合分组的结果）
# group_obj = df.groupby('g1')				# 按照一列进行分组
group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组
for i in group_obj:
    print('开始一次group_obj的迭代\n',type(i)); print('-----------')
    for j in i:
        print(j); print('-----------')
        print(type(j)); print('-----------')
执行结果：
df.head()
                   c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
2019-09-23  0.978738  2.240893  N  Y
2019-09-24  1.867558 -0.977278  M  X
2019-09-25  0.950088 -0.151357  N  X
2019-09-26 -0.103219  0.410599  M  Y
===========
开始一次group_obj的迭代
 <class 'tuple'>
-----------
('M', 'X')
-----------
<class 'tuple'>
-----------
                  c1        c2 g1 g2
2019-09-24  1.867558 -0.977278  M  X
...              ...       ... .. ..
2022-06-17 -1.141901 -1.310970  M  X
[260 rows x 4 columns]
-----------
<class 'pandas.core.frame.DataFrame'>
-----------
开始一次group_obj的迭代
 <class 'tuple'>
-----------
('M', 'Y')
-----------
<class 'tuple'>
-----------
                  c1        c2 g1 g2
2019-09-26 -0.103219  0.410599  M  Y
...              ...       ... .. ..
2022-06-15  0.197828  0.097751  M  Y
[245 rows x 4 columns]
-----------
<class 'pandas.core.frame.DataFrame'>
-----------
开始一次group_obj的迭代
 <class 'tuple'>
-----------
('N', 'X')
-----------
<class 'tuple'>
-----------
                  c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
...              ...       ... .. ..
2022-06-14  1.315138 -0.323457  N  X
[256 rows x 4 columns]
-----------
<class 'pandas.core.frame.DataFrame'>
-----------
开始一次group_obj的迭代
 <class 'tuple'>
-----------
('N', 'Y')
-----------
<class 'tuple'>
-----------
                  c1        c2 g1 g2
2019-09-23  0.978738  2.240893  N  Y
...              ...       ... .. ..
2022-06-16  1.401523       NaN  N  Y
[239 rows x 4 columns]
-----------
<class 'pandas.core.frame.DataFrame'>

# 例2：group_obj的应用（group_obj.apply()见例3）
import numpy as np
import pandas as pd
# 创建数据集
np.random.seed(0)
period = pd.date_range('2019-9-22', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)
df['g1'] = np.random.choice(['M', 'N'], 1000)
df['g2'] = np.random.choice(['X', 'Y'], 1000)
for i in period:    # 随机产生空值
    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan
    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan
print('df.head()\n',df.head()); print('===========')
# 创建分组对象
group_obj = df.groupby('g1')				# 按照一列进行分组
# group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组
print('group_obj\n',group_obj); print('===========')
print('type(group_obj)\n',type(group_obj)); print('===========')
# 使用分组对象的指定方法查看一项统计信息
print('group_obj.size()\n',group_obj.size(),'\n',type(group_obj.size())); print('===========')
print('group_obj.max()\n',group_obj.max(),'\n',type(group_obj.max())); print('===========')
print('group_obj.min()\n',group_obj.min(),'\n',type(group_obj.min())); print('===========')
print('group_obj.sum()\n',group_obj.sum(),'\n',type(group_obj.sum())); print('===========')
print('group_obj.mean()\n',group_obj.mean(),'\n',type(group_obj.mean())); print('===========')
print('group_obj.std()\n',group_obj.std(),'\n',type(group_obj.std())); print('===========')
print('group_obj.count()\n',group_obj.count(),'\n',type(group_obj.count())); print('===========')
# 使用分组对象的describe()方法查看全部统计信息
print('group_obj.describe()\n',group_obj.describe(),'\n',type(group_obj.describe())); print('===========')
print('group_obj.describe().T\n',group_obj.describe().T,'\n',type(group_obj.describe().T)); print('===========')
# 使用分组对象的agg()方法查看自定义项目的统计信息
print('group_obj.agg([np.mean, np.std])\n',group_obj.agg([np.mean, np.std]),'\n',type(group_obj.agg([np.mean, np.std]))); print('===========')
print("group_obj.agg({'c1':np.mean, 'c2':np.std})\n",group_obj.agg({'c1':np.mean, 'c2':np.std}),'\n',type(group_obj.agg({'c1':np.mean, 'c2':np.std}))); print('===========')
# 使用分组对象的get_group()方法获取指定分组的数据
# 按照一列进行分组时获取数据
print("group_obj.get_group('M').head()\n",group_obj.get_group('M').head())
# 按照两列联合分组时获取数据
# print("group_obj.get_group(('M','X')).head()\n",group_obj.get_group(('M','X')).head())

# 例2的执行结果(按照一列进行分组)：
df.head()
                   c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
2019-09-23  0.978738  2.240893  N  Y
2019-09-24  1.867558 -0.977278  M  X
2019-09-25  0.950088 -0.151357  N  X
2019-09-26 -0.103219  0.410599  M  Y
===========
group_obj
 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000003DEB518>
===========
type(group_obj)
 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
===========
group_obj.size()
 g1
M    505
N    495
dtype: int64
 <class 'pandas.core.series.Series'>
===========
group_obj.max()
           c1        c2 g2
g1
M   2.680571  2.642936  Y
N   3.170975  2.759355  Y
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.min()
           c1        c2 g2
g1
M  -2.802203 -2.772593  X
N  -2.994613 -3.046143  X
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.sum()
            c1         c2
g1
M  -21.519519 -23.540760
N    3.243500   1.208014
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.mean()
           c1        c2
g1
M  -0.046080 -0.048338
N   0.006901  0.002549
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.std()
           c1        c2
g1
M   0.989397  0.982953
N   0.963094  0.980751
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.count()
      c1   c2   g2
g1
M   467  487  505
N   470  474  495
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.describe()
        c1                                ...        c2
    count      mean       std       min  ...       25%       50%       75%       max
g1                                       ...
M   467.0 -0.046080  0.989397 -2.802203  ... -0.735622 -0.065488  0.549966  2.642936
N   470.0  0.006901  0.963094 -2.994613  ... -0.600193 -0.003481  0.656109  2.759355
[2 rows x 16 columns]
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.describe().T
 g1                 M           N
c1 count  467.000000  470.000000
   mean    -0.046080    0.006901
   std      0.989397    0.963094
   min     -2.802203   -2.994613
   25%     -0.726487   -0.669359
   50%     -0.056133    0.038123
   75%      0.603422    0.669524
   max      2.680571    3.170975
c2 count  487.000000  474.000000
   mean    -0.048338    0.002549
   std      0.982953    0.980751
   min     -2.772593   -3.046143
   25%     -0.735622   -0.600193
   50%     -0.065488   -0.003481
   75%      0.549966    0.656109
   max      2.642936    2.759355
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.agg([np.mean, np.std])
           c1                  c2
        mean       std      mean       std
g1
M  -0.046080  0.989397 -0.048338  0.982953
N   0.006901  0.963094  0.002549  0.980751
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.agg({'c1':np.mean, 'c2':np.std})
           c1        c2
g1
M  -0.046080  0.982953
N   0.006901  0.980751
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.get_group('M').head()
                   c1        c2 g1 g2
2019-09-24  1.867558 -0.977278  M  X
2019-09-26 -0.103219  0.410599  M  Y
2019-09-29  0.443863  0.333674  M  Y
2019-09-30  1.494079 -0.205158  M  X
2019-10-05  0.045759 -0.187184  M  X

# 例2的执行结果(按照两列联合分组)：
df.head()
                   c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
2019-09-23  0.978738  2.240893  N  Y
2019-09-24  1.867558 -0.977278  M  X
2019-09-25  0.950088 -0.151357  N  X
2019-09-26 -0.103219  0.410599  M  Y
===========
group_obj
 <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000003F5B518>
===========
type(group_obj)
 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
===========
group_obj.size()
 g1  g2
M   X     260
    Y     245
N   X     256
    Y     239
dtype: int64
 <class 'pandas.core.series.Series'>
===========
group_obj.max()
              c1        c2
g1 g2
M  X   2.320800  2.642936
   Y   2.680571  2.380745
N  X   3.170975  2.412454
   Y   2.497200  2.759355
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.min()
              c1        c2
g1 g2
M  X  -2.802203 -2.534554
   Y  -2.437564 -2.772593
N  X  -2.582797 -2.739677
   Y  -2.994613 -3.046143
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.sum()
               c1         c2
g1 g2
M  X  -18.444141 -21.132778
   Y   -3.075378  -2.407982
N  X   20.090225  -1.593149
   Y  -16.846726   2.801164
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.mean()
              c1        c2
g1 g2
M  X  -0.077496 -0.083860
   Y  -0.013430 -0.010247
N  X   0.084413 -0.006611
   Y  -0.072615  0.012022
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.std()
              c1        c2
g1 g2
M  X   1.002794  0.976511
   Y   0.976399  0.990480
N  X   0.985893  0.900264
   Y   0.934578  1.059461
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.count()
         c1   c2
g1 g2
M  X   238  252
   Y   229  235
N  X   238  241
   Y   232  233
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.describe()
           c1                      ...        c2
       count      mean       std  ...       50%       75%       max
g1 g2                             ...
M  X   238.0 -0.077496  1.002794  ... -0.122370  0.553665  2.642936
   Y   229.0 -0.013430  0.976399  ...  0.024612  0.548398  2.380745
N  X   238.0  0.084413  0.985893  ...  0.001248  0.547481  2.412454
   Y   232.0 -0.072615  0.934578  ... -0.008210  0.823504  2.759355
[4 rows x 16 columns]
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.describe().T
 g1                 M                       N
g2                 X           Y           X           Y
c1 count  238.000000  229.000000  238.000000  232.000000
   mean    -0.077496   -0.013430    0.084413   -0.072615
   std      1.002794    0.976399    0.985893    0.934578
   min     -2.802203   -2.437564   -2.582797   -2.994613
   25%     -0.799725   -0.652409   -0.531745   -0.708426
   50%     -0.031588   -0.061743    0.039490    0.018710
   75%      0.601873    0.604137    0.721432    0.591636
   max      2.320800    2.680571    3.170975    2.497200
c2 count  252.000000  235.000000  241.000000  233.000000
   mean    -0.083860   -0.010247   -0.006611    0.012022
   std      0.976511    0.990480    0.900264    1.059461
   min     -2.534554   -2.772593   -2.739677   -3.046143
   25%     -0.698973   -0.740036   -0.575788   -0.680178
   50%     -0.122370    0.024612    0.001248   -0.008210
   75%      0.553665    0.548398    0.547481    0.823504
   max      2.642936    2.380745    2.412454    2.759355
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.agg([np.mean, np.std])
              c1                  c2
           mean       std      mean       std
g1 g2
M  X  -0.077496  1.002794 -0.083860  0.976511
   Y  -0.013430  0.976399 -0.010247  0.990480
N  X   0.084413  0.985893 -0.006611  0.900264
   Y  -0.072615  0.934578  0.012022  1.059461
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.agg({'c1':np.mean, 'c2':np.std})
              c1        c2
g1 g2
M  X  -0.077496  0.976511
   Y  -0.013430  0.990480
N  X   0.084413  0.900264
   Y  -0.072615  1.059461
 <class 'pandas.core.frame.DataFrame'>
===========
group_obj.get_group(('M','X')).head()
                   c1        c2 g1 g2
2019-09-24  1.867558 -0.977278  M  X
2019-09-30  1.494079 -0.205158  M  X
2019-10-05  0.045759 -0.187184  M  X
2019-10-12 -1.048553 -1.420018  M  X
2019-10-13       NaN  1.950775  M  X

# 例3：group_obj.apply()的原理及应用
import numpy as np
import pandas as pd
# 创建数据集
np.random.seed(0)
period = pd.date_range('2019-9-22', periods=1000, freq='D')
df = pd.DataFrame(np.random.randn(1000, 2), columns=['c1','c2'], index = period)
df['g1'] = np.random.choice(['M', 'N'], 1000)
df['g2'] = np.random.choice(['X', 'Y'], 1000)
for i in period:    # 随机产生空值
    if np.random.random() < 0.05: df.loc[i,'c1'] = np.nan
    if np.random.random() < 0.05: df.loc[i,'c2'] = np.nan
print('df.head()\n',df.head()); print('===========')
# 仅演示按照两列联合分组的情况
# group_obj = df.groupby('g1')			# 按照一列进行分组
group_obj = df.groupby(['g1','g2'])		# 按照两列联合分组
def group_func(df):
    print('---group_func内部开始---')
    print(df)
    print(type(df))
    print('---group_func内部结束---')
    # 将df按照'c1'列升序排列，取前两行结果作为返回值
    return df.sort_values(['c1'], ascending=True)[:2]
result = group_obj.apply(group_func)
print('===========')
print(result); print('-----------')
print(type(result)); print('-----------')
print(result.index)
执行结果：
df.head()
                   c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
2019-09-23  0.978738  2.240893  N  Y
2019-09-24  1.867558 -0.977278  M  X
2019-09-25  0.950088 -0.151357  N  X
2019-09-26 -0.103219  0.410599  M  Y
===========
---group_func内部开始---
                  c1        c2 g1 g2
2019-09-24  1.867558 -0.977278  M  X
...              ...       ... .. ..
2022-06-17 -1.141901 -1.310970  M  X
[260 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
---group_func内部结束---
---group_func内部开始---
                  c1        c2 g1 g2
2019-09-26 -0.103219  0.410599  M  Y
...              ...       ... .. ..
2022-06-15  0.197828  0.097751  M  Y
[245 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
---group_func内部结束---
---group_func内部开始---
                  c1        c2 g1 g2
2019-09-22       NaN  0.400157  N  X
...              ...       ... .. ..
2022-06-14  1.315138 -0.323457  N  X
[256 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
---group_func内部结束---
---group_func内部开始---
                  c1        c2 g1 g2
2019-09-23  0.978738  2.240893  N  Y
...              ...       ... .. ..
2022-06-16  1.401523       NaN  N  Y
[239 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
---group_func内部结束---
===========
                        c1        c2 g1 g2
g1 g2
M  X  2021-08-31 -2.802203 -1.188424  M  X
      2020-03-07 -2.659172  0.606320  M  X
   Y  2022-03-03 -2.437564  1.114925  M  Y
      2020-05-13 -2.288620  0.251484  M  Y
N  X  2020-11-20 -2.582797 -1.153950  N  X
      2020-06-12 -2.369587  0.864052  N  X
   Y  2021-09-14 -2.994613  0.880938  N  Y
      2021-06-11 -2.777359  1.151734  N  Y
-----------
<class 'pandas.core.frame.DataFrame'>
-----------
MultiIndex([('M', 'X', '2021-08-31'),
            ('M', 'X', '2020-03-07'),
            ('M', 'Y', '2022-03-03'),
            ('M', 'Y', '2020-05-13'),
            ('N', 'X', '2020-11-20'),
            ('N', 'X', '2020-06-12'),
            ('N', 'Y', '2021-09-14'),
            ('N', 'Y', '2021-06-11')],
           names=['g1', 'g2', None])

12. DataFrame的合并

DataFrame的合并可以基于DataFrame定义、concat()、join()、merge()来实现，其中merge()的功能最为强大，使用它也可以实现其他方法的功能

（1）基于DataFrame定义的合并

最原始的合并方式，需要手动逐列定义字典的键（列标签）和值（列数据），行标签则只能采取外连接方式（取并集）

import numpy as np
import pandas as pd
# 定义原始数据
df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])
# 开始合并
df = pd.DataFrame({'C2': df1['c2'], 'C3': df2['c3']})
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df\n',df)
执行结果：
df1
     c2  c1
t2   1   2
t1   3   4
===========
df2
     c3  c1
t3   5   6
t1   7   8
===========
df
      C2   C3
t1  3.0  7.0
t2  1.0  NaN
t3  NaN  5.0

（2）df.append()

语法：df.append(obj, sort=???, ignore_index=False)

整体与pd.concat()实现的效果类似，纵向拼接，行标签不合并且保留原始顺序，列标签会合并

参数：

obj：拼接对象，可以是一个DataFrame或Series。当是Series时，须满足下面两个条件之一（即要么拼接的这行自己有名字作为其行标签，要么忽略所有行标签），否则报错：
- 该Series有name属性
- df.append()中的ignore_index参数值为True
sort：布尔值，拼接后的DataFrame是否按列标签排序。注意：当前版本Pandas默认值为None（根据不同的情况默认True或False），未来版本会取消默认值，因此为保险起见，不管sort为True还是False都要写上，不写会弹出警告（warning）
ignore_index：拼接后的DataFrame是否忽略所有行标签（即重置为0、1、2……的行位置索引），默认值为False

import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()
# 定义原始数据
df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])
s1 = pd.Series([9,10,11],index=['c2','c4','c3'], name='s1')
s2 = pd.Series([12,13,14],index=['c2','c4','c3'])
# 开始拼接
df3 = df1.append(df2, sort=False)
df4 = df1.append(df2, sort=True)
df5 = df1.append(df2, sort=False, ignore_index=True)
df6 = df1.append(s1, sort=False)
df7 = df1.append(s2, sort=False, ignore_index=True)
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7)
执行结果：
df1
     c2  c1
t2   1   2
t1   3   4
===========
df2
     c3  c1
t3   5   6
t1   7   8
===========
df3
      c2  c1   c3
t2  1.0   2  NaN
t1  3.0   4  NaN
t3  NaN   6  5.0
t1  NaN   8  7.0
===========
df4
     c1   c2   c3
t2   2  1.0  NaN
t1   4  3.0  NaN
t3   6  NaN  5.0
t1   8  NaN  7.0
===========
df5
     c2  c1   c3
0  1.0   2  NaN
1  3.0   4  NaN
2  NaN   6  5.0
3  NaN   8  7.0
===========
df6
      c2   c1    c3    c4
t2  1.0  2.0   NaN   NaN
t1  3.0  4.0   NaN   NaN
s1  9.0  NaN  11.0  10.0
===========
df7
      c2   c1    c3    c4
0   1.0  2.0   NaN   NaN
1   3.0  4.0   NaN   NaN
2  12.0  NaN  14.0  13.0

（3）df.join()

语法：df1.join(df2, how='left', lsuffix='', rsuffix='')

只能实现横向拼接，列标签不允许重名也不能合并（重名时须指定后缀），行标签可以选择左、右、内、外四种连接方式（默认左连接）

参数：

how：行标签显示方式
- 默认值为'left'，左连接，显示左侧df的所有行标签
- how='right'时，右连接，显示右侧df的所有行标签
- how='inner'时，内连接，显示df1和df2行标签的交集
- how='outer'时，外连接，显示df1和df2行标签的并集
lsuffix：有重名列时，左侧df该列的标签添加的后缀，默认为空字符串''
rsuffix：有重名列时，右侧df该列的标签添加的后缀，默认为空字符串''

注意：

当有重名列时，lsuffix和rsuffix至少应有一个不为空，否则报错

import numpy as np
import pandas as pd
# 定义原始数据
df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])
# 开始合并
df3 = df1.join(df2, lsuffix='_l', rsuffix='_r')
df4 = df1.join(df2, how='left', lsuffix='_l', rsuffix='_r')
df5 = df1.join(df2, how='right', lsuffix='_l', rsuffix='_r')
df6 = df1.join(df2, how='inner', lsuffix='_l', rsuffix='_r')
df7 = df1.join(df2, how='outer', lsuffix='_l', rsuffix='_r')
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7)
执行结果：
df1
     c2  c1
t2   1   2
t1   3   4
===========
df2
     c3  c1
t3   5   6
t1   7   8
===========
df3
     c2  c1_l   c3  c1_r
t2   1     2  NaN   NaN
t1   3     4  7.0   8.0
===========
df4
     c2  c1_l   c3  c1_r
t2   1     2  NaN   NaN
t1   3     4  7.0   8.0
===========
df5
      c2  c1_l  c3  c1_r
t3  NaN   NaN   5     6
t1  3.0   4.0   7     8
===========
df6
     c2  c1_l  c3  c1_r
t1   3     4   7     8
===========
df7
      c2  c1_l   c3  c1_r
t1  3.0   4.0  7.0   8.0
t2  1.0   2.0  NaN   NaN
t3  NaN   NaN  5.0   6.0

（4）pd.concat()

语法：pd.concat(objs, axis=0, join='outer', ignore_index=False, sort=???, keys=None, names=None)

实现纵向拼接（拼接轴为y轴）或横向拼接（拼接轴为x轴），拼接轴上的标签不合并且保留原始顺序，非拼接轴上的标签会合并

参数：

objs：由若干个DataFrame组成的可迭代对象，如(df1,df2)、[df1,df2,df3]等
axis：默认值为0，纵向拼接，行标签不合并且顺序不变，列标签合并；axis=1时，横向拼接，列标签不合并且顺序不变，行标签合并。
join：默认值为'outer'，非拼接轴外连接（取并集）；join='inner'时，非拼接轴内连接（取交集）

注意：join的取值只能是'outer'或'inner'，没有别的
ignore_index：默认值为False，保留拼接轴的标签索引；ignore_index=True时，删除拼接轴的标签索引
sort：当join='outer'时，非拼接轴是否按标签排序；当join='inner'时，sort参数没有用。注意：当前版本Pandas默认值为True，未来版本默认值将改为False，因此为保险起见，只要join='outer'，不管sort为True还是False都要写上，不写会弹出警告（warning）
keys：是一个list，当axis=0时，在y轴最外层添加一个层次化索引，这个list中的每个元素都是该层次化索引的行标签，即该list的长度应该等于objs的长度（因为给objs中的每一个DataFrame都分配了一个最外层层次化索引的行标签）；当axis=1则对x轴执行相似操作。代码示例见“AQF笔记-第2部分-第7章-金融数据源处理实现-三、金融数据的处理-2.同时获取多只股价信息”
names：是一个list，当axis=0时，里面的每个元素都是y轴的每级层次化索引的名字（因为每级层次化索引都是一个Series，相当于批量设置每个Series的name属性）；当axis=1则对x轴执行相似操作。代码示例见“AQF笔记-第2部分-第7章-金融数据源处理实现-三、金融数据的处理-2.同时获取多只股价信息”

import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()
# 定义原始数据
df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])
# 开始拼接
df3 = pd.concat((df1,df2), sort=True)
df4 = pd.concat((df1,df2), sort=False)
df5 = pd.concat((df1,df2), sort=True, join='inner')
df6 = pd.concat((df1,df2), axis=1, sort=True)
df7 = pd.concat((df1,df2), ignore_index=True, sort=True)
df8 = pd.concat((df1,df2), axis=1, ignore_index=True, sort=True)
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7); print('===========')
print('df8\n',df8)
执行结果：
df1
     c2  c1
t2   1   2
t1   3   4
===========
df2
     c3  c1
t3   5   6
t1   7   8
===========
df3
     c1   c2   c3
t2   2  1.0  NaN
t1   4  3.0  NaN
t3   6  NaN  5.0
t1   8  NaN  7.0
===========
df4
      c2  c1   c3
t2  1.0   2  NaN
t1  3.0   4  NaN
t3  NaN   6  5.0
t1  NaN   8  7.0
===========
df5
     c1
t2   2
t1   4
t3   6
t1   8
===========
df6
      c2   c1   c3   c1
t1  3.0  4.0  7.0  8.0
t2  1.0  2.0  NaN  NaN
t3  NaN  NaN  5.0  6.0
===========
df7
    c1   c2   c3
0   2  1.0  NaN
1   4  3.0  NaN
2   6  NaN  5.0
3   8  NaN  7.0
===========
df8
       0    1    2    3
t1  3.0  4.0  7.0  8.0
t2  1.0  2.0  NaN  NaN
t3  NaN  NaN  5.0  6.0

（5）pd.merge()

只能实现横向拼接，列标签不会出现重名也不能合并（重名时会自动添加后缀），行标签可以选择内、外、左、右、四种连接方式（默认内连接）。根据主键选取方式的不同，语法分为三种情况（分别对应下方示例代码的例1、例2、例3）：

以行标签为主键（行标签相同的合并为一行），保留行标签信息：
```
pd.merge(df1, df2, how='inner', left_index=True, right_index=True, sort=False, suffixes=('_x', '_y'))
```
注意：此方法与join()实现的效果共同；此外，left_index和right_index的默认值都是False，如果想采用此方法，需要手动传关键字参数
以df1和df2的一个同名列为主键（该列值相同的合并为一行），会导致丢失行标签信息：
```
pd.merge(df1, df2, how='inner', on='主键列标签', sort=False, suffixes=('_x', '_y'))
```
注意：不写on参数时，Pandas会以首个df1和df2都有的同名列为主键，可能会自动匹配到错误的列，因此建议写上on这个参数，以便明确地声明使用哪列作为主键

以df1的某一列为主键，以df2的另一列为主键（主键列值相同的合并为一行），会导致丢失行标签信息：

pd.merge(df1, df2, how='inner', left_on='df1的主键列标签', right_on='df2的主键列标签', sort=False, suffixes=('_x','_y'))

公共参数：

how：行标签显示方式
- 默认值为'inner'，内连接，主键列显示df1和df2主键的交集
- how='outer'时，外连接，主键列显示df1和df2主键的并集
- how='left'时，左连接，主键列为左侧df1的主键列
- how='right'时，右连接，主键列为右侧df2的主键列
sort：是否对返回结果的主键列排序，默认值为False
suffixes：非主键列重名时添加的后缀，是一个元组类型数据，元组第一项给df1用，第二项给df2用，默认值为('_x', '_y')。元组的两项值可以相同，但是不能同时为空

# 例1：以行标签为主键（行标签相同的合并为一行）
import numpy as np
import pandas as pd
# 定义原始数据
df1 = pd.DataFrame([[1,2],[3,4]], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[5,6],[7,8]], columns=['c3','c1'], index=['t3','t1'])
# 开始合并
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
df4 = pd.merge(df1,df2,how='outer',left_index=True,right_index=True)
df5 = pd.merge(df1,df2,how='outer',left_index=True,right_index=True,sort=True,suffixes=('_df1', '_df2'))
df6 = pd.merge(df1,df2,how='left',left_index=True,right_index=True)
df7 = pd.merge(df1,df2,how='right',left_index=True,right_index=True)
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7)
执行结果：
df1
     c2  c1
t2   1   2
t1   3   4
===========
df2
     c3  c1
t3   5   6
t1   7   8
===========
df3
     c2  c1_x  c3  c1_y
t1   3     4   7     8
===========
df4
      c2  c1_x   c3  c1_y
t1  3.0   4.0  7.0   8.0
t2  1.0   2.0  NaN   NaN
t3  NaN   NaN  5.0   6.0
===========
df5
      c2  c1_df1   c3  c1_df2
t1  3.0     4.0  7.0     8.0
t2  1.0     2.0  NaN     NaN
t3  NaN     NaN  5.0     6.0
===========
df6
     c2  c1_x   c3  c1_y
t2   1     2  NaN   NaN
t1   3     4  7.0   8.0
===========
df7
      c2  c1_x  c3  c1_y
t3  NaN   NaN   5     6
t1  3.0   4.0   7     8

# 例2：以df1和df2的同名列'c1'为主键（'c1'列值相同的合并为一行），会导致丢失行标签信息
import numpy as np
import pandas as pd
# 定义原始数据
df1 = pd.DataFrame([[1,'C'],[2,'B']], columns=['c2','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[3,'C'],[4,'A']], columns=['c2','c1'], index=['t2','t1'])
# 开始合并
df3 = pd.merge(df1,df2)  # 以首个df1和df2都有的列为主键，因此这里自动匹配到的是'c2'列，不是'c1'列
df4 = pd.merge(df1,df2,how='inner',on='c1')
df5 = pd.merge(df1,df2,how='outer',on='c1')
df6 = pd.merge(df1,df2,how='outer',on='c1',sort=True,suffixes=('_df1', '_df2'))
df7 = pd.merge(df1,df2,how='left',on='c1')
df8 = pd.merge(df1,df2,how='right',on='c1')
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7); print('===========')
print('df8\n',df8)
执行结果：
df1
     c2 c1
t2   1  C
t1   2  B
===========
df2
     c2 c1
t2   3  C
t1   4  A
===========
df3
 Empty DataFrame
Columns: [c2, c1]
Index: []
===========
df4
    c2_x c1  c2_y
0     1  C     3
===========
df5
    c2_x c1  c2_y
0   1.0  C   3.0
1   2.0  B   NaN
2   NaN  A   4.0
===========
df6
    c2_df1 c1  c2_df2
0     NaN  A     4.0
1     2.0  B     NaN
2     1.0  C     3.0
===========
df7
    c2_x c1  c2_y
0     1  C   3.0
1     2  B   NaN
===========
df8
    c2_x c1  c2_y
0   1.0  C     3
1   NaN  A     4

# 例3：以df1的'c1'列为主键，以df2的'c2'列为主键（主键列值相同的合并为一行），会导致丢失行标签信息
import numpy as np
import pandas as pd
# 定义原始数据
df1 = pd.DataFrame([[1,'C'],[2,'B']], columns=['c3','c1'], index=['t2','t1'])
df2 = pd.DataFrame([[3,'C'],[4,'A']], columns=['c3','c2'], index=['t2','t1'])
# 开始合并
df3 = pd.merge(df1,df2,left_on='c1',right_on='c2')
df4 = pd.merge(df1,df2,how='outer',left_on='c1',right_on='c2')
df5 = pd.merge(df1,df2,how='outer',left_on='c1',right_on='c2',sort=True,suffixes=('_df1','_df2'))
df6 = pd.merge(df1,df2,how='left',left_on='c1',right_on='c2')
df7 = pd.merge(df1,df2,how='right',left_on='c1',right_on='c2')
print('df1\n',df1); print('===========')
print('df2\n',df2); print('===========')
print('df3\n',df3); print('===========')
print('df4\n',df4); print('===========')
print('df5\n',df5); print('===========')
print('df6\n',df6); print('===========')
print('df7\n',df7)
执行结果：
df1
     c3 c1
t2   1  C
t1   2  B
===========
df2
     c3 c2
t2   3  C
t1   4  A
===========
df3
    c3_x c1  c3_y c2
0     1  C     3  C
===========
df4
    c3_x   c1  c3_y   c2
0   1.0    C   3.0    C
1   2.0    B   NaN  NaN
2   NaN  NaN   4.0    A
===========
df5
    c3_df1   c1  c3_df2   c2
0     NaN  NaN     4.0    A
1     2.0    B     NaN  NaN
2     1.0    C     3.0    C
===========
df6
    c3_x c1  c3_y   c2
0     1  C   3.0    C
1     2  B   NaN  NaN
===========
df7
    c3_x   c1  c3_y c2
0   1.0    C     3  C
1   NaN  NaN     4  A

13. Series和DataFrame的层次化索引

由于在最新版本的Pandas中已将该Panel数据类型删除，因此可以使用层次化索引间接实现Panel数据类型的效果

（1）Series的层次化索引

① 创建层次化索引的Series

语法和创建普通Series的语法相同，只须把index变为多维结构即可。定义了层次化索引的Series后，s.index的数据类型变成了pandas.core.indexes.multi.MultiIndex

靠前的索引（如下例中的大写字母）是外层索引，其level值为0；靠后的索引（如下例中的小写字母）是内层索引，其level值以整数递增（本例中其level=1）。

import numpy as np
import pandas as pd
s = pd.Series([1,2,3,4,5,6,7,8],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
print('s','\n',s); print('===========')
print('s.index','\n',s.index,'\n',type(s.index))
执行结果：
s
 A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
dtype: int64
===========
s.index
 MultiIndex([('A', 'e'),
            ('A', 'f'),
            ('B', 'e'),
            ('B', 'g'),
            ('C', 'f'),
            ('C', 'h'),
            ('D', 'g'),
            ('D', 'h')],
           )
 <class 'pandas.core.indexes.multi.MultiIndex'>

② 层次化索引的Series的索引和切片

import numpy as np
import pandas as pd
s = pd.Series([1,2,3,4,5,6,7,8],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
print('s','\n',s); print('===========')
print("s['A']",'\n',s['A'],'\n',type(s['A'])); print('===========')
print("s['A':'C']",'\n',s['A':'C'],'\n',type(s['A':'C'])); print('===========')
print("s[['A','C']]",'\n',s[['A','C']],'\n',type(s[['A','C']])); print('===========')
print("s[:,'f']",'\n',s[:,'f'],'\n',type(s[:,'f'])); print('===========')
print("s['A','e']",'\n',s['A','e'],'\n',type(s['A','e'])); print('===========')
# 下面几种形式会导致报错：
# print(s['A':'C','f'])
# print(s[['A','C'],'f'])
# print(s[:,'e':'f'])
# print(s[:,['e','f']])
执行结果：
s
 A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
dtype: int64
===========
s['A']
 e    1
f    2
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s['A':'C']
 A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s[['A','C']]
 A  e    1
   f    2
C  f    5
   h    6
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s[:,'f']
 A    2
C    5
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s['A','e']
 1
 <class 'numpy.int64'>

③ 层次化索引的Series的分组聚合

s.sum(level=0)与s.groupby(level=0).sum()等效

s.sum(level=1)与s.groupby(level=1).sum()等效

import numpy as np
import pandas as pd
s = pd.Series([1,2,3,4,5,6,7,8],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
print('s','\n',s); print('===========')
s1 = s.sum(level=0)
s2 = s.sum(level=1)
s3 = s.groupby(level=0).sum()
s4 = s.groupby(level=1).sum()
print('s1','\n',s1,'\n',type(s1)); print('===========')
print('s2','\n',s2,'\n',type(s2)); print('===========')
print('s3','\n',s3,'\n',type(s3)); print('===========')
print('s4','\n',s4,'\n',type(s4))
执行结果：
s
 A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
dtype: int64
===========
s1
 A     3
B     7
C    11
D    15
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s2
 e     4
f     7
g    11
h    14
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s3
 A     3
B     7
C    11
D    15
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s4
 e     4
f     7
g    11
h    14
dtype: int64
 <class 'pandas.core.series.Series'>

（2）DataFrame的层次化索引

① 创建层次化索引的DataFrame

语法和创建普通DataFrame的语法相同，只须把index变为多维结构即可。定义了层次化索引的DataFrame后，df.index的数据类型变成了pandas.core.indexes.multi.MultiIndex

关于层次化索引的df.index.name和df.index.names的区别，见本章“二、Pandas模块 - 5. DataFrame对象的属性 - （3）df.index.name和df.index.names”

import numpy as np
import pandas as pd
# 创建方式一
df = pd.DataFrame([1,2,3,4,5,6,7,8])
df.columns = ['c1']
df.index = [['A','A','B','B','C','C','D','D'],
            ['e','f','e','g','f','h','g','h']]
df.index.name='my_index_name'
df.index.names = ['i1','i2']
# 创建方式二（两种方式等效）
"""
df = pd.DataFrame([['A','e',1],
                  ['A','f',2],
                  ['B','e',3],
                  ['B','g',4],
                  ['C','f',5],
                  ['C','h',6],
                  ['D','g',7],
                  ['D','h',8]])
df.columns=['i1','i2','c1']
df = df.set_index(['i1','i2'])
df.index.name='my_index_name'
"""
print('df\n',df,'\n',type(df)); print('===========')
print('df.index\n',df.index,'\n',type(df.index)); print('-----------')
print(df.index.name); print('-----------')
print(df.index.names); print('===========')
print('df.columns\n',df.columns,'\n',type(df.columns))
执行结果：
df
        c1
i1 i2
A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
 <class 'pandas.core.frame.DataFrame'>
===========
df.index
 MultiIndex([('A', 'e'),
            ('A', 'f'),
            ('B', 'e'),
            ('B', 'g'),
            ('C', 'f'),
            ('C', 'h'),
            ('D', 'g'),
            ('D', 'h')],
           name='my_index_name')
 <class 'pandas.core.indexes.multi.MultiIndex'>
-----------
my_index_name
-----------
['i1', 'i2']
===========
df.columns
 Index(['c1'], dtype='object')
 <class 'pandas.core.indexes.base.Index'>

② 层次化索引的DataFrame的索引和切片

import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['c1'],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
print('df\n',df,'\n',type(df)); print('===========')
print("df.loc['A']",'\n',df.loc['A'],'\n',type(df.loc['A'])); print('===========')
print("df.loc['A':'C']",'\n',df.loc['A':'C'],'\n',type(df.loc['A':'C'])); print('===========')
print("df.loc[['A','C']]",'\n',df.loc[['A','C']],'\n',type(df.loc[['A','C']])); print('===========')
print("df.loc[('A','f')]",'\n',df.loc[('A','f')],'\n',type(df.loc[('A','f')])); print('===========')
print("df.loc[('A','f'),'c1']",'\n',df.loc[('A','f'),'c1'],'\n',type(df.loc[('A','f'),'c1'])); print('===========')
# 下面的写法将导致错误：
# print("df.loc[:,'f']",'\n',df.loc[:,'f'],'\n',type(df.loc[:,'f'])); print('===========')
# print("df.loc[(:,'f')]",'\n',df.loc[(:,'f')],'\n',type(df.loc[(:,'f')])); print('===========')
# print("df.loc[('A','f':'h')]",'\n',df.loc[('A','f':'h')],'\n',type(df.loc[('A','f':'h')])); print('===========')
# print("df.loc[(['A','C'],'f')]",'\n',df.loc[(['A','C'],'f')],'\n',type(df.loc[(['A','C'],'f')])); print('===========')
# print("df.loc[('A',['f','h'])]",'\n',df.loc[('A',['f','h'])],'\n',type(df.loc[('A',['f','h'])])); print('===========')
执行结果：
df
      c1
A e   1
  f   2
B e   3
  g   4
C f   5
  h   6
D g   7
  h   8
 <class 'pandas.core.frame.DataFrame'>
===========
df.loc['A']
    c1
e   1
f   2
 <class 'pandas.core.frame.DataFrame'>
===========
df.loc['A':'C']
      c1
A e   1
  f   2
B e   3
  g   4
C f   5
  h   6
 <class 'pandas.core.frame.DataFrame'>
===========
df.loc[['A','C']]
      c1
A e   1
  f   2
C f   5
  h   6
 <class 'pandas.core.frame.DataFrame'>
===========
df.loc[('A','f')]
 c1    2
Name: (A, f), dtype: int64
 <class 'pandas.core.series.Series'>
===========
df.loc[('A','f'),'c1']
 2
 <class 'numpy.int64'>

③ 层次化索引的DataFrame的分组聚合

df.sum(level=0)与df.groupby(level=0).sum()等效

df.sum(level=1)与df.groupby(level=1).sum()等效

import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['c1'],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
print('df\n',df,'\n',type(df)); print('===========')
df1 = df.sum(level=0)
df2 = df.groupby(level=0).sum()
df3 = df.sum(level=1)
df4 = df.groupby(level=1).sum()
print('df1','\n',df1,'\n',type(df1)); print('===========')
print('df2','\n',df2,'\n',type(df2)); print('===========')
print('df3','\n',df3,'\n',type(df3)); print('===========')
print('df4','\n',df4,'\n',type(df4))
执行结果：
df
      c1
A e   1
  f   2
B e   3
  g   4
C f   5
  h   6
D g   7
  h   8
 <class 'pandas.core.frame.DataFrame'>
===========
df1
    c1
A   3
B   7
C  11
D  15
 <class 'pandas.core.frame.DataFrame'>
===========
df2
    c1
A   3
B   7
C  11
D  15
 <class 'pandas.core.frame.DataFrame'>
===========
df3
    c1
e   4
f   7
g  11
h  14
 <class 'pandas.core.frame.DataFrame'>
===========
df4
    c1
e   4
f   7
g  11
h  14
 <class 'pandas.core.frame.DataFrame'>

④ 重置层次化索引

df.reset_index()：重置所有层次化索引

df.reset_index(level=0)：重置level=0的层次化索引

df.reset_index(level=1)：重置level=1的层次化索引

也可以重新设定df.index.levels，详见本章5.DataFrame对象的属性 - （4）df.index.levels

（3）使用unstack()和stack()对层次化索引的Series和DataFrame进行变形（行标签与列标签的转换）

stack: v.堆栈，unstack：v.拆栈

s.unstack()的效果为：将s最内层行标签（纵向）进行转置变为列标签（横向），若转置后数据的行数变为1，则为Series类型数据，否则为DataFrame类型数据

s.stack()：报错，Series数据类型没有stack()方法，因为Series数据类型没有可供转置用的列标签

df.unstack()的效果为：将df最内层行标签（纵向）进行转置变为列标签（横向），若转置后数据的行数变为1，则为Series类型数据，否则为DataFrame类型数据

df.stack()的效果为：将df列标签（横向）进行转置变为最内层行标签（纵向），若转置后数据的列数变为1，则为Series类型数据，否则为DataFrame类型数据

# Series的数据变形的例子
import numpy as np
import pandas as pd
s = pd.Series([1,2,3,4,5,6,7,8],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
# s_s = s.stack()   				# 导致报错
s_u = s.unstack()
s_us = s.unstack().stack()			# 又变回了s
s_uu = s.unstack().unstack()		# 实现了内、外层索引的互转
print('s\n',s,'\n',type(s)); print('===========')
# print('s_s\n',s_s,'\n',type(s_s)); print('===========')	# 导致报错
print('s_u\n',s_u,'\n',type(s_u)); print('===========')
print('s_us\n',s_us,'\n',type(s_us)); print('===========')
print('s_uu\n',s_uu,'\n',type(s_uu))
执行结果：
s
 A  e    1
   f    2
B  e    3
   g    4
C  f    5
   h    6
D  g    7
   h    8
dtype: int64
 <class 'pandas.core.series.Series'>
===========
s_u
      e    f    g    h
A  1.0  2.0  NaN  NaN
B  3.0  NaN  4.0  NaN
C  NaN  5.0  NaN  6.0
D  NaN  NaN  7.0  8.0
 <class 'pandas.core.frame.DataFrame'>
===========
s_us
 A  e    1.0
   f    2.0
B  e    3.0
   g    4.0
C  f    5.0
   h    6.0
D  g    7.0
   h    8.0
dtype: float64
 <class 'pandas.core.series.Series'>
===========
s_uu
 e  A    1.0
   B    3.0
   C    NaN
   D    NaN
f  A    2.0
   B    NaN
   C    5.0
   D    NaN
g  A    NaN
   B    4.0
   C    NaN
   D    7.0
h  A    NaN
   B    NaN
   C    6.0
   D    8.0
dtype: float64
 <class 'pandas.core.series.Series'>

# DataFrame的数据变形的例子
import numpy as np
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8],columns=['t1'],
              index=[
                  ['A','A','B','B','C','C','D','D'],
                  ['e','f','e','g','f','h','g','h']
              ])
df_s = df.stack()
df_u = df.unstack()
df_us = df.unstack().stack()		# 又变回了df
df_uu = df.unstack().unstack()
print('df\n',df,'\n',type(df)); print('===========')
print('df_s\n',df_s,'\n',type(df_s)); print('===========')
print('df_u\n',df_u,'\n',type(df_u)); print('===========')
print('df_us\n',df_us,'\n',type(df_us)); print('===========')
print('df_uu\n',df_uu,'\n',type(df_uu))
执行结果：
df
      t1
A e   1
  f   2
B e   3
  g   4
C f   5
  h   6
D g   7
  h   8
 <class 'pandas.core.frame.DataFrame'>
===========
df_s
 A  e  t1    1
   f  t1    2
B  e  t1    3
   g  t1    4
C  f  t1    5
   h  t1    6
D  g  t1    7
   h  t1    8
dtype: int64
 <class 'pandas.core.series.Series'>
===========
df_u
     t1
     e    f    g    h
A  1.0  2.0  NaN  NaN
B  3.0  NaN  4.0  NaN
C  NaN  5.0  NaN  6.0
D  NaN  NaN  7.0  8.0
 <class 'pandas.core.frame.DataFrame'>
===========
df_us
       t1
A e  1.0
  f  2.0
B e  3.0
  g  4.0
C f  5.0
  h  6.0
D g  7.0
  h  8.0
 <class 'pandas.core.frame.DataFrame'>
===========
df_uu
 t1  e  A    1.0
       B    3.0
       C    NaN
       D    NaN
    f  A    2.0
       B    NaN
       C    5.0
       D    NaN
    g  A    NaN
       B    4.0
       C    NaN
       D    7.0
    h  A    NaN
       B    NaN
       C    6.0
       D    8.0
dtype: float64
 <class 'pandas.core.series.Series'>

14. Pandas中的时间相关格式及方法

（1）Pandas中的时间格式及特殊索引、切片方法

① pandas._libs.tslibs.timestamps.Timestamp：时间戳

② pandas.core.indexes.datetimes.DatetimeIndex：时间格式索引

③ pandas._libs.tslibs.period.Period：时期

④ pandas.core.indexes.period.PeriodIndex：时期格式索引

其中，②是由①组成的，④是由③组成的

当一个DataFrame拥有②或者④格式的行标签索引时，它将支持下面各种灵活的索引、切片方式（注意：对频率为月的④类型行标签索引，索引该月的任意一天即视为索引该月；对频率为年的④类型行标签索引，索引该年的任意一天即视为索引该年）：

# 精确索引，只能使用df.loc[]和df.ix[]方式
# df['2019-12-31']	# 精确索引不能使用df[]格式，会导致报错
# df['2019.12.31']	# 精确索引不能使用df[]格式，会导致报错
df.loc['20191231']
df.loc['2019-12-31']
df.loc[pd.datetime(2019,12,31)]
df.ix['20191231']
df.ix['2019-12-31']
df.ix[pd.datetime(2019,12,31)]
...
# 模糊索引，可以使用df[]、df.loc[]和df.ix[]方式
df['2019-12']
df['2019.12']
df['2019']
df.loc['2019-12']
df.loc['2019.12']
df.loc['2019']
df.ix['2019-12']
df.ix['2019.12']
df.ix['2019']
...
# 混合使用精确索引和模糊索引进行切片
df['2019-08':'2019-09-22']
df.loc['2019-08':'2019-09-22']
df.ix['2019-08':'2019-09-22']
...

（2）pd.Timestamp()

语法：pd.Timestamp(n)

返回一个pandas._libs.tslibs.timestamps.Timestamp对象

参数n：经过时间原点的纳秒（10的负9次方秒）数

import numpy as np
import pandas as pd
t1 = pd.Timestamp(0)
t2 = pd.Timestamp(1)
print(t1)
print(t2)
print(type(t2))
执行结果：
1970-01-01 00:00:00
1970-01-01 00:00:00.000000001
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

（3）pd.datetime()

语法：pd.datetime(年,月,日)

返回一个datetime.datetime对象

import numpy as np
import pandas as pd
t = pd.datetime(2019,9,22)
print(t)
print(type(t))
执行结果：
2019-09-22 00:00:00
<class 'datetime.datetime'>

（4）pd.to_datetime()

语法：pd.to_datetime(“看着像日期”的数据类型或其组成的list、ndarray、Series)

将“看着像日期”的数据类型（或其组成的list、ndarray、Series）转换为pandas._libs.tslibs.timestamps.Timestamp类型数据（或pandas.core.indexes.datetimes.DatetimeIndex类型数据）

独立的None会转换为None，但是列表中的None则会转换为NaT（pandas._libs.tslibs.nattype.NaTType类的实例化对象）

可使用df['列标签'] = pd.to_datetime(df['列标签']) 的形式将df中的某列从str类型转换为时间类型

可使用df.index = pd.to_datetime(df.index) 的形式将df中的行索引从str类型转换为时间类型

import numpy as np
import pandas as pd
# print(pd.to_datetime('2019922'))  导致报错
print(pd.to_datetime('20190922'),type(pd.to_datetime('20190922'))); print('-----------')
print(pd.to_datetime(['2019-09-22','2019.09.23']))
print(pd.to_datetime(['2019-9-22','2019.9.23']))
print(pd.to_datetime(['Sept 22 2019','September 23rd, 2019'])); print('===========')
print(pd.to_datetime(None),type(pd.to_datetime(None))); print('-----------')
print(pd.to_datetime([None])); print(type(pd.to_datetime([None])));
print(pd.to_datetime([None])[0]); print(type(pd.to_datetime([None])[0])); print('===========')
print(pd.to_datetime(
    np.array(['20190922','20190923'])
)); print('-----------')
print(pd.to_datetime(
    pd.Series(['20190922','20190923'])
))
执行结果：
2019-09-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
-----------
DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)
===========
None <class 'NoneType'>
-----------
DatetimeIndex(['NaT'], dtype='datetime64[ns]', freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
NaT
<class 'pandas._libs.tslibs.nattype.NaTType'>
===========
DatetimeIndex(['2019-09-22', '2019-09-23'], dtype='datetime64[ns]', freq=None)
-----------
0   2019-09-22
1   2019-09-23
dtype: datetime64[ns]

（5）pd.DatetimeIndex()

输入一个由“看着像日期”的数据类型组成的一维list，将其中的每一项元素转为pandas._libs.tslibs.timestamps.Timestamp类型后，整体以pandas.core.indexes.datetimes.DatetimeIndex类型返回

列表中的None会转换为NaT（pandas._libs.tslibs.nattype.NaTType类的实例化对象）

可使用df['列标签'] = pd.DatetimeIndex(df['列标签']) 的形式将df中的某列从str类型转换为时间类型

import datetime
import numpy as np
import pandas as pd
dti1 = pd.DatetimeIndex(['20190101','20190102',None])
dti2 = pd.DatetimeIndex(['2019-01-01','2019-01-02'])
dti3 = pd.DatetimeIndex(['Jan 1,2019','January 2nd, 2019'])
dti4 = pd.DatetimeIndex([datetime.datetime(2019,1,1),datetime.datetime(2019,1,2)])
dti5 = pd.DatetimeIndex([pd.datetime(2019,1,1),pd.datetime(2019,1,2)])
dti6 = pd.DatetimeIndex([pd.Timestamp(0),pd.Timestamp(1e18)])
print(dti1,'\n',type(dti1[0]))
print(dti2,'\n',type(dti2[0]))
print(dti3,'\n',type(dti3[0]))
print(dti4,'\n',type(dti4[0]))
print(dti5,'\n',type(dti5[0]))
print(dti6,'\n',type(dti6[0]))
执行结果：
DatetimeIndex(['2019-01-01', '2019-01-02', 'NaT'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['2019-01-01', '2019-01-02'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
DatetimeIndex(['1970-01-01 00:00:00', '2001-09-09 01:46:40'], dtype='datetime64[ns]', freq=None)
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>

（6）pd.date_range()

语法：pd.date_range(start=None, end=None, periods=None, freq='D')

生成由若干个pandas._libs.tslibs.timestamps.Timestamp对象组成的pandas.core.indexes.datetimes.DatetimeIndex对象

参数：

start：起始日期
end：终止日期
periods：长度（数据个数）

freq：频率（相邻数据的间隔时间），默认值为1天'D'。可以改成诸如30秒'30S'、5分钟'5T'、2小时'2H'、3天'3D'、2周'2W'、每月最后一天'M'、每月第一天'MS'、1年'Y'等形式。此外，频率以'B'为单位时代表工作日，但是这个工作日仅仅代表周一到周五，不考虑法定节假日。此项参数的其他复杂取值：

名称	说明
W-MON	周-星期一
WOM-1MON	月-第1个星期一
Q-JAN	季度，以一月最后一日结束（可把JAN换成FEB, MAR）
QS-JAN	季度，以一月第一日结束（可把JAN换成FEB, MAR）
A-JAN	年，以一月最后一个日历日结束（可把JAN换成FEB,...,DEC）
AS-JAN	年，以一月第一个日历日结束（可把JAN换成FEB,...,DEC）

注意：

参数freq默认值为'D'，start、end、periods三个参数，至少要输入两个，否则报错
pd.date_range()常用于给df.index赋值，以便生成行标签，如：
```
...
df.index = pd.date_range('2019-9-22', periods=5, freq='M')
```
pd.date_range()生成的DatetimeIndex对象可以用索引方式来取值，如：
```
...
t = pd.date_range('2019-9-22', periods=5, freq='M')
print(t[0])
```
不可以直接用字符串来判断其是否等于返回结果中的某一个日期，可以用pd.datetime()来判断。不过，对于DataFrame中的pd.date_range()类型的标签索引，既可以使用pd.datetime()进行标签索引，也可以使用字符串进行标签索引

代码示例：

import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore') # 忽略可能会出现的警告信息；警告并不是错误，可以忽略；可能出现警告的场景包括：df.ix[]、pd.concat()
t1 = pd.date_range('2019-9-22', periods=2, freq='3D')
t2 = pd.date_range('2019-9-22', periods=2, freq='2W')
t3 = pd.date_range('2019-9-22', periods=3, freq='M')
t4 = pd.date_range('2019-9-22', periods=3, freq='Y')
print('t1','\n',t1,'\n',type(t1),'\n',t1[0],'\n',type(t1[0])); print('-----------')
print('t2','\n',t2); print('-----------')
print('t3','\n',t3); print('-----------')
print('t4','\n',t4); print('===========')
print(t1[0] == '2019-9-22')                     		        # 错误的判断方式
print(t1[0] == '2019-09-22')                       		        # 错误的判断方式
print(t1[0] == pd.datetime(2019,9,22))							# 正确的判断方式
执行结果：
t1
 DatetimeIndex(['2019-09-22', '2019-09-25'], dtype='datetime64[ns]', freq='3D')
 <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
 2019-09-22 00:00:00
 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
-----------
t2
 DatetimeIndex(['2019-09-22', '2019-10-06'], dtype='datetime64[ns]', freq='2W-SUN')
-----------
t3
 DatetimeIndex(['2019-09-30', '2019-10-31', '2019-11-30'], dtype='datetime64[ns]', freq='M')
-----------
t4
 DatetimeIndex(['2019-12-31', '2020-12-31', '2021-12-31'], dtype='datetime64[ns]', freq='A-DEC')
===========
False
False
True

（7）pd.period_range()

语法：pd.period_range(start=None, end=None, periods=None, freq='D')

生成由若干个pandas._libs.tslibs.period.Period对象组成的pandas.core.indexes.period.PeriodIndex对象

参数：

start：起始日期
end：终止日期
periods：长度（数据个数）
freq：频率（相邻数据的间隔时间），默认值为1天'D'。可以改成诸如30秒'30S'、5分钟'5T'、2小时'2H'、3天'3D'、每周一'W-Mon'、2周'2W'、1个月'M'、1年'Y'等形式。此外，频率以'B'为单位时代表工作日，但是这个工作日仅仅代表周一到周五，不考虑法定节假日。当频率以月为单位时，产生的数据中仅有年、月；当频率以年为单位时，产生的数据中仅有年。

注意：

参数freq默认值为'D'时，start、end、periods三个参数，至少要输入两个，否则报错
pd.period_range()常用于给df.index赋值，以便生成行标签，如：
```
...
df.index = pd.period_range('2019-9-22', periods=5, freq='W')
```
pd.period_range()生成的PeriodIndex对象可以用索引方式来取值，如：
```
...
p = pd.period_range('2019-9-22', periods=5, freq='M')
print(p[0])
```
不可以直接用字符串来判断其是否等于返回结果中的某一个日期，也不能用pd.datetime()来判断。不过，对于DataFrame中的pandas.core.indexes.period.PeriodIndex类型的标签索引，既可以使用pd.datetime()进行标签索引，也可以使用字符串进行标签索引（对频率为月的pandas.core.indexes.period.PeriodIndex类型行标签索引，索引该月的任意一天即视为索引该月；对频率为年的pandas.core.indexes.period.PeriodIndex类型行标签索引，索引该年的任意一天即视为索引该年）

代码示例：

# 不可以直接用字符串来判断其是否等于返回结果中的某一个日期，也不能用pd.datetime()来判断
import numpy as np
import pandas as pd
p1 = pd.period_range('2019-9-22', periods=2, freq='3D')
p2 = pd.period_range('2019-9-22', periods=2, freq='2W')
p3 = pd.period_range('2019-9-22', periods=3, freq='M')
p4 = pd.period_range('2019-9-22', periods=3, freq='Y')
print('p1','\n',p1,'\n',type(p1),'\n',p1[0],'\n',type(p1[0])); print('-----------')
print('p2','\n',p2); print('-----------')
print('p3','\n',p3); print('-----------')
print('p4','\n',p4); print('===========')
print(p1[0] == '2019-9-22')					# 错误的判断方式
print(p1[0] == '2019-09-22')				# 错误的判断方式
print(p1[0] == pd.datetime(2019,9,22))		# 错误的判断方式
执行结果：
p1
 PeriodIndex(['2019-09-22', '2019-09-25'], dtype='period[3D]', freq='3D')
 <class 'pandas.core.indexes.period.PeriodIndex'>
 2019-09-22
 <class 'pandas._libs.tslibs.period.Period'>
-----------
p2
 PeriodIndex(['2019-09-16/2019-09-22', '2019-09-30/2019-10-06'], dtype='period[2W-SUN]', freq='2W-SUN')
-----------
p3
 PeriodIndex(['2019-09', '2019-10', '2019-11'], dtype='period[M]', freq='M')
-----------
p4
 PeriodIndex(['2019', '2020', '2021'], dtype='period[A-DEC]', freq='A-DEC')
===========
False
False
False

# 频率为日的索引示例
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.randn(5,2)
p = pd.period_range('2019-9-22', periods=5, freq='D')		# 频率为日
df = pd.DataFrame(arr, columns=['c1','c2'], index=p)
print(df); print('===========')
print(df.loc['2019-9-22']); print('===========')
print(df.loc[pd.datetime(2019,9,22)])
执行结果：
                  c1        c2
2019-09-22  1.764052  0.400157
2019-09-23  0.978738  2.240893
2019-09-24  1.867558 -0.977278
2019-09-25  0.950088 -0.151357
2019-09-26 -0.103219  0.410599
===========
c1    1.764052
c2    0.400157
Name: 2019-09-22, dtype: float64
===========
c1    1.764052
c2    0.400157
Name: 2019-09-22, dtype: float64

# 频率为月的索引示例
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.randn(5,2)
p = pd.period_range('2019-9-22', periods=5, freq='M')      # 频率为月
df = pd.DataFrame(arr, columns=['c1','c2'], index=p)
print(df); print('-----------')
print(df.loc['20190922']); print('-----------')
print(df.loc['2019-9-23']); print('-----------')
print(df.loc[pd.datetime(2019,9,22)]); print('-----------')
print(df.loc[pd.datetime(2019,9,23)])
执行结果：
               c1        c2
2019-09  1.764052  0.400157
2019-10  0.978738  2.240893
2019-11  1.867558 -0.977278
2019-12  0.950088 -0.151357
2020-01 -0.103219  0.410599
-----------
c1    1.764052
c2    0.400157
Name: 2019-09, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019-09, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019-09, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019-09, dtype: float64

# 频率为年的索引示例
import numpy as np
import pandas as pd
np.random.seed(0)
arr = np.random.randn(5,2)
p = pd.period_range('2019-9-22', periods=5, freq='Y')      # 频率为年
df = pd.DataFrame(arr, columns=['c1','c2'], index=p)
print(df); print('-----------')
print(df.loc['20190922']); print('-----------')
print(df.loc['2019-9-23']); print('-----------')
print(df.loc[pd.datetime(2019,9,22)]); print('-----------')
print(df.loc[pd.datetime(2019,9,23)])
执行结果：
            c1        c2
2019  1.764052  0.400157
2020  0.978738  2.240893
2021  1.867558 -0.977278
2022  0.950088 -0.151357
2023 -0.103219  0.410599
-----------
c1    1.764052
c2    0.400157
Name: 2019, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019, dtype: float64
-----------
c1    1.764052
c2    0.400157
Name: 2019, dtype: float64

（8）pd.date_range()和pd.period_range()的对比

相同点：

返回值可以作为DataFrame的行标签索引，并支持df[]形式的特殊索引
返回值都可以作为DataFrame数据中的一列

不同点：

freq='M'、freq='Y'时显示的数据不同
数据类型不同
由于数据类型不同，导致个别的属性和方法不同（这里不再展开）

（9）df.resample()：重采样

resample：v.重采样

对行索引为pandas.core.indexes.datetimes.DatetimeIndex或pandas.core.indexes.period.PeriodIndex类型的DataFrame进行重采样（频率调整），具体步骤为：

首先，获取DataFrame格式数据df
接着，使用resample_obj = df.resample(rule,axis=0,closed=None)获取resample对象（pandas.core.resample.DatetimeIndexResampler类型）

参数：
- rule：调整后的频率，如：'S'（秒）、'T'或'min'（分钟）、'H'（小时）、'D'（天）、'W'（周）、'M'（月）、'Q'（季度）、'A'或'Y'（年），还可以在字母前加上数字，如：'3D'（3天）
- axis：默认值为0，按列处理；axis=1时按行处理（一般无须指定此参数）
- closed：时间区间的闭合方式，left为前闭，right为后闭（一般无须指定此参数）
最后，应用resemple对象的相应方法进行处理，如：
- 高频调整为低频（降采样）适用的方法：
  - resample_obj.mean()：使用对应时间段内所有数据的平均值进行聚合
  - resample_obj.max()：使用对应时间段内所有数据的最大值进行聚合（最高价聚合常用）
  - resample_obj.min()：使用对应时间段内所有数据的最小值进行聚合（最低价聚合常用）
  - resample_obj.median()：使用对应时间段内所有数据的中位数进行聚合
  - resample_obj.sum()：使用对应时间段内所有数据的和进行聚合（成交量聚合常用）
  - resample_obj.prod()：使用对应时间段内所有数据的乘积进行聚合
  - resample_obj.std()：使用对应时间段内所有数据的标准差进行聚合
  - resample_obj.var()：使用对应时间段内所有数据的方差进行聚合
  - resample_obj.count()：使用对应时间段内所有非空数据的计数进行聚合
  - resample_obj.first()：使用对应时间段内的第一个数据进行聚合（开盘价聚合常用）
  - resample_obj.last()：使用对应时间段内的最后一个数据进行聚合（收盘价聚合常用）
  - resample_obj.nunique()：使用对应时间段内有多少个不同的值来进行聚合
  - resample_obj.asfreq()：使用显示的日期所对应的数据进行聚合（比如将日数据降采样为月数据，显示的是每月最后一天，就使用这天的数据进行聚合，但是每月最后一天可能不是交易日没有数据，此时这条数据就是NaN）
  - resample_obj.ohlc()：使用对应时间段内所有数据的open、high、low、close四项特征数据进行聚合
  - resample_obj.apply(<func>)：使用自定义的聚合函数，apply()方法的详细解释本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他方法”，示例代码见本节下面
- 低频调整为高频（升采样）适用的方法：
  - resample_obj.ffill()：使用向前填充法处理空值
  - resample_obj.pad()：使用向前填充法处理空值
  - resample_obj.bfill()：使用向后填充法处理空值
  - resample_obj.fillna()：使用fillna()方法处理空值
  - 线性插值法：
    - resample_obj.interpolate()：使用线性插值法填充两个数据之间的空值，简便，推荐
    - df.interpolate()：也可以不基于resample对象，手动在两条数据间插入指定数量的空值，然后使用DataFrame对象自带的插值法进行填充，详见本章“9. DataFrame的空值（NaN）处理-（5）df.interpolate()”。但是这样操作过于复杂，不推荐
  - resample_obj.apply(<func>)：使用自定义的插值函数，每次传到func()里的是对应时间段的数据组成的Series或DataFrame，详见本章“10. DataFrame对象的方法和Pandas模块的方法 -（5）其他重要方法 - ①df.apply()”

注意：讲义中提到的df.resample('M', how='mean')里的how参数已弃用，以上述新方法为准。

① 高频调整为低频（降采样）：通过聚合实现

注意：聚合后的索引是聚合前的索引在该时间段内的最后一个值（若日数据按照月重采样，则索引变为每个月的最后一天）

# 基于收益率的resample_obj.mean()和resample_obj.apply(<func>)方法示例
import pandas as pd
import numpy as np
# 读取本地文件'000001.csv'
data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)
# 使用收盘价计算每日收益率
data_return = data['close'] / data['close'].shift()
# 获取频率为月的resample对象
resample_obj = data_return.resample('M')
print(resample_obj,'\n',type(resample_obj)); print('===========')
# 使用对应月份所有数据的平均值进行聚合（两种方式等效）
print(resample_obj.mean()); print('-----------')
print(resample_obj.apply(lambda x: x.mean())); print('===========')     # 当只有一列时，x为Series；当有多列时，x为DataFrame
# 检验聚合结果是否正确
print(data_return['2017-03'].mean().round(6)); print('===========')
print(data_return['2019-07'].mean().round(6))
执行结果：
DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]
 <class 'pandas.core.resample.DatetimeIndexResampler'>
===========
date
2017-02-28    1.005302
2017-03-31    0.998575
...
2019-07-31    1.001199
2019-08-31    1.000752
Freq: M, Name: close, dtype: float64
-----------
date
2017-02-28    1.005302
2017-03-31    0.998575
...
2019-07-31    1.001199
2019-08-31    1.000752
Freq: M, Name: close, dtype: float64
===========
0.998575
===========
1.001199

# 基于收盘价的resample_obj.ohlc()方法示例
import pandas as pd
import numpy as np
# 读取本地文件'000001.csv'
data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)
# 获取每日收盘价
data_close = data['close']
# 获取频率为1个月的resample对象
resample_obj = data_close.resample('M')
print(resample_obj,'\n',type(resample_obj)); print('===========')
# 使用对应时间段内所有数据的open、high、low、close四项特征数据进行聚合
print(resample_obj.ohlc()); print('===========')
# 检验OHLC结果是否正确
print(data_close['2017-03-01'])
print(data_close['2017-04'].max())
print(data_close['2019-06'].min())
print(data_close['2019-07-31'])
执行结果：
DatetimeIndexResampler [freq=<MonthEnd>, axis=0, closed=right, label=right, convention=start, base=0]
 <class 'pandas.core.resample.DatetimeIndexResampler'>
===========
             open   high    low  close
date
2017-02-28   9.43   9.48   9.43   9.48
2017-03-31   9.49   9.52   9.08   9.17
2017-04-30   9.21   9.21   8.91   8.99
...
2019-06-30  11.90  13.80  11.85  13.78
2019-07-31  13.93  14.37  13.54  14.13
2019-08-31  14.10  15.12  13.35  14.25
===========
9.49
9.21
11.85
14.13

② 低频调整为高频（升采样）：通过线性插值实现

注意：若使用resample_obj.interpolate()方法，则插值前的索引是插值后的索引在该时间段的第一个值（假设日数据按照小时重采样，则每天的实际值变为该日00:00:00的值）

# 基于收盘价的resample_obj.interpolate()方法示例
import pandas as pd
import numpy as np
# 读取本地文件'000001.csv'
data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)
# 获取前两天的收盘价
data_close = data['close'][1:3]
print(data_close); print('===========')
# 获取频率为1小时的resample对象
resample_obj = data_close.resample('H')
print(resample_obj,'\n',type(resample_obj)); print('===========')
# 使用线性插值法填补中间空缺的数据
print(resample_obj.interpolate())
执行结果：
date
2017-02-28    9.48
2017-03-01    9.49
Name: close, dtype: float64
===========
DatetimeIndexResampler [freq=<Hour>, axis=0, closed=left, label=left, convention=start, base=0]
 <class 'pandas.core.resample.DatetimeIndexResampler'>
===========
date
2017-02-28 00:00:00    9.480000
2017-02-28 01:00:00    9.480417
2017-02-28 02:00:00    9.480833
2017-02-28 03:00:00    9.481250
...
2017-02-28 21:00:00    9.488750
2017-02-28 22:00:00    9.489167
2017-02-28 23:00:00    9.489583
2017-03-01 00:00:00    9.490000
Freq: H, Name: close, dtype: float64

（10）df.rolling()：滑动时间窗

对DataFrame或Series类型数据进行滑动时间窗处理，具体步骤为：

首先，获取DataFrame格式数据df（或Series类型数据s）
接着，使用rolling_obj = df.rolling(window, min_periods=None, center=False)获取rolling对象（pandas.core.window.Rolling类型），参数：
- window：时间窗大小，即时间窗中包含几个数据，必须输入（int类型）
- min_periods：在边界处使时间窗计算结果不为NaN的最小数据量，默认为None，此时时间窗中必须有window个数据才会计算结果，否则显示NaN；指定min_periods后，时间窗中只要有min_periods个数据就可以计算结果。
- center：时间窗标签是否居中，默认为False，此时时间窗标签为时间窗内最后一个时间点（即时间窗内最后一行的行标签）；center=True时，时间窗标签为时间窗内中间位置的时间点（即时间窗内中间一行的行标签）
最后，应用rolling对象的相应方法进行处理，如：
- rolling_obj.mean()：使用时间窗内所有数据的平均值作为时间窗标签对应的值（移动平均SMA）
- rolling_obj.max()：使用时间窗内所有数据的最大值作为时间窗标签对应的值
- rolling_obj.min()：使用时间窗内所有数据的最小值作为时间窗标签对应的值
- rolling_obj.sum()：使用时间窗内所有数据的和作为时间窗标签对应的值
- rolling_obj.std()：使用时间窗内所有数据的标准差作为时间窗标签对应的值
- rolling_obj.apply(<func>)：编写自定义函数func，将时间窗内所有数据作为参数传递给func，并将func的返回值作为时间窗标签对应的值

# 滑动时间窗示例
import pandas as pd
import numpy as np
# 读取本地文件'000001.csv'
data = pd.read_csv('000001.csv', index_col=0, parse_dates=True)
# 获取每日收盘价
data_close = data['close']
# 获取不同参数的rolling对象
rolling_obj3 = data_close.rolling(3)
rolling_obj32 = data_close.rolling(3, min_periods=2)
rolling_obj31 = data_close.rolling(3, min_periods=1)
rolling_obj3c = data_close.rolling(3, center=True)
# 使用rolling对象进行滑动时间窗处理
print('data_close\n',data_close); print('===========')
print(rolling_obj3,type(rolling_obj3)); print('===========')
print('rolling_obj3.mean()\n',rolling_obj3.mean()); print('===========')
print('rolling_obj32.mean()\n',rolling_obj32.mean()); print('===========')
print('rolling_obj31.mean()\n',rolling_obj31.mean()); print('===========')
print('rolling_obj3c.mean()\n',rolling_obj3c.mean())
执行结果：
data_close
 date
2017-02-27     9.43
2017-02-28     9.48
2017-03-01     9.49
2017-03-02     9.43
2017-03-03     9.40
              ...
2019-08-20    14.99
2019-08-21    14.45
2019-08-22    14.31
2019-08-23    14.65
2019-08-26    14.25
Name: close, Length: 613, dtype: float64
===========
Rolling [window=3,center=False,axis=0] <class 'pandas.core.window.Rolling'>
===========
rolling_obj3.mean()
 date
2017-02-27          NaN
2017-02-28          NaN
2017-03-01     9.466667
2017-03-02     9.466667
2017-03-03     9.440000
                ...
2019-08-20    14.936667
2019-08-21    14.786667
2019-08-22    14.583333
2019-08-23    14.470000
2019-08-26    14.403333
Name: close, Length: 613, dtype: float64
===========
rolling_obj32.mean()
 date
2017-02-27          NaN
2017-02-28     9.455000
2017-03-01     9.466667
2017-03-02     9.466667
2017-03-03     9.440000
                ...
2019-08-20    14.936667
2019-08-21    14.786667
2019-08-22    14.583333
2019-08-23    14.470000
2019-08-26    14.403333
Name: close, Length: 613, dtype: float64
===========
rolling_obj31.mean()
 date
2017-02-27     9.430000
2017-02-28     9.455000
2017-03-01     9.466667
2017-03-02     9.466667
2017-03-03     9.440000
                ...
2019-08-20    14.936667
2019-08-21    14.786667
2019-08-22    14.583333
2019-08-23    14.470000
2019-08-26    14.403333
Name: close, Length: 613, dtype: float64
===========
rolling_obj3c.mean()
 date
2017-02-27          NaN
2017-02-28     9.466667
2017-03-01     9.466667
2017-03-02     9.440000
2017-03-03     9.426667
                ...
2019-08-20    14.786667
2019-08-21    14.583333
2019-08-22    14.470000
2019-08-23    14.403333
2019-08-26          NaN
Name: close, Length: 613, dtype: float64

Python数据分析之Pandas操作大全的更多相关文章

Python数据分析之Numpy操作大全
从头到尾都是手码的,文中的所有示例也都是在Pycharm中运行过的,自己整理笔记的最大好处在于可以按照自己的思路来构建矿建,等到将来在需要的时候能够以最快的速度看懂并应用=_= 注:为方便表述,本章设 ...
Python数据分析库pandas基本操作
Python数据分析库pandas基本操作2017年02月20日 17:09:06 birdlove1987 阅读数:22631 标签: python 数据分析 pandas 更多个人分类: Pyt ...
【Python数据分析】Python3操作Excel(二) 一些问题的解决与优化
继上一篇[Python数据分析]Python3操作Excel-以豆瓣图书Top250为例对豆瓣图书Top250进行爬取以后,鉴于还有一些问题没有解决,所以进行了进一步的交流讨论,这期间得到了一只尼玛 ...
Python数据分析之pandas基本数据结构：Series、DataFrame
1引言本文总结Pandas中两种常用的数据类型: (1)Series是一种一维的带标签数组对象. (2)DataFrame,二维,Series容器 2 Series数组 2.1 Series数组构成 ...
Python 数据分析：Pandas 缺省值的判断
Python 数据分析:Pandas 缺省值的判断背景我们从数据库中取出数据存入 Pandas None 转换成 NaN 或 NaT.但是,我们将 Pandas 数据写入数据库时又需要转换成 No ...
python中numpy矩阵运算操作大全（非常全）！
python中numpy矩阵运算操作大全(非常全) //2019.07.10晚python矩阵运算大全1.矩阵的输出形式:对于任何一个矩阵,python输出的模板是:import numpy as n ...
Python数据分析之pandas学习
Python中的pandas模块进行数据分析. 接下来pandas介绍中将学习到如下8块内容:1.数据结构简介:DataFrame和Series2.数据索引index3.利用pandas查询数据4.利 ...
Python openpyxl、pandas操作Excel方法简介与具体实例
本篇重点讲解windows系统下 Python3.5中第三方excel操作库-openpyxl: 其实Python第三方库有很多可以操作Excel,如:xlrd,xlwt,xlwings甚至注明的数据 ...
Python数据分析之pandas
Python中的pandas模块进行数据分析. 接下来pandas介绍中将学习到如下8块内容:1.数据结构简介:DataFrame和Series2.数据索引index3.利用pandas查询数据4.利 ...

随机推荐

转载：DRC
https://cn.mathworks.com/help/audio/ug/dynamic-range-control.html?requestedDomain=cn.mathworks.com h ...
windows系统下，gpu开发环境部署
1,安装python,使用anaconda或者直接用python.exe安装都可以.我用的是python3.6版的对于相关的程序包,比如tensorflow或者opencv等,anaconda可以在 ...
curl模板----php发送post,get请求
function _grab($curl,$ip='',$referer='',$postInfo='',$cookie=''){ $ch = curl_init(); curl_setopt($ch ...
IDEA 运行项目、模块的多个实例
IDEA默认只能运行同一项目|模块的一个实例. 运行多个实例: 比如springcloud的端口设置: --server.port=9001 . 当然,也可以在项目的配置文件中修改参数. 命令行.ID ...
bugku getshell
http://123.206.87.240:8002/web9/ 该题是walf严格匹配,通过修改Content-type后字母的大小写可以绕过检测, 然后还有,后缀黑名单检测和类型检测,逐个绕过,如 ...
python evel()的用法
老生常谈部分: eval(expression[, globals[, locals]]) expression -- 表达式. globals -- 变量作用域,全局命名空间,如果被提供,则必须是一 ...
02-Spring的IOC示例程序（通过id获取对象）
*******通过IOC容器创建id对象并为属性赋值******** 整体结构: ①创建一个java工程 ②导包 ③创建log4j.properties日记配置文件 # Global logging ...
NIO的理解
一.缓冲区(Buffer):在java NIO中负责数据的存取,实际上就是数组,用于存储不用数据类型的数据,根据数据类型不同(boolean除外),提供了相应类型的缓冲区(ByteBuffer,Cha ...
【Python协程的实现】
" 补充:数据安全问题进程: 多个进程操作同一个文件,会出现数据不安全线程: 多个线程操作同一个全局变量,会出现数据不安全对于共享的数据操作: 如果是 += *= /= -= 操作,都 ...
disconf---分布式配置管理平台的搭建（windows版本）
本人由刚开始接触博客,难免会有不足和错误,写博客只是记录本人在学习和工作的过程中的成长,如有不足,欢迎各位指正,谢谢~ 一.废话不多说,直接进入正题: ①获取github代码 https://gith ...