数据可视化基础专题（六）：Pandas基础（五）索引和数据选择器（查找）

1.序言

如何切片，切块，以及通常获取和设置pandas对象的子集

2.索引的不同选择

对象选择已经有许多用户请求的添加，以支持更明确的基于位置的索引。Pandas现在支持三种类型的多轴索引。

.loc主要是基于标签的，但也可以与布尔数组一起使用。当找不到物品时.loc会提高KeyError。允许的输入是：
- 单个标签，例如5或'a'（注意，它5被解释为索引的标签。此用法不是索引的整数位置。）。
- 列表或标签数组。['a', 'b', 'c']
- 带标签的切片对象'a':'f'（注意，相反普通的Python片，都开始和停止都包括在内，当存在于索引中！见有标签切片和端点都包括在内。）
- 布尔数组
- 一个callable带有一个参数的函数（调用Series或DataFrame）并返回有效的索引输出（上面的一个）。
版本0.18.1中的新功能。

在标签选择中查看更多信息。
.iloc是基于主要的整数位置（从0到 length-1所述轴的），但也可以用布尔阵列使用。如果请求的索引器超出范围，.iloc则会引发IndexError，但允许越界索引的切片索引器除外。（这符合Python / NumPy 切片语义）。允许的输入是：
- 一个整数，例如5。
- 整数列表或数组。[4, 3, 0]
- 带有整数的切片对象1:7。
- 布尔数组。
- 一个callable带有一个参数的函数（调用Series或DataFrame）并返回有效的索引输出（上面的一个）。
版本0.18.1中的新功能。

有关详细信息，请参阅按位置选择，高级索引和高级层次结构。
.loc，.iloc以及[]索引也可以接受一个callable索引器。在Select By Callable中查看更多信息。

从具有多轴选择的对象获取值使用以下表示法（使用.loc作为示例，但以下也适用.iloc）。任何轴访问器可以是空切片:。假设超出规范的轴是:，例如p.loc['a']相当于。p.loc['a', :, :]

对象类型	索引
系列	s.loc[indexer]
数据帧	df.loc[row_indexer,column_indexer]

3基础知识

正如在上一节中介绍数据结构时所提到的，索引的主要功能[]（也就是__getitem__ 那些熟悉在Python中实现类行为的人）是选择低维切片。下表显示了使用以下方法索引pandas对象时的返回类型值[]：

对象类型	选择	返回值类型
系列	series[label]	标量值
数据帧	frame[colname]	Series 对应于colname

我们构建一个简单的时间序列数据集，用于说明索引功能：

In [1]: dates = pd.date_range('1/1/2000', periods=8)

In [2]: df = pd.DataFrame(np.random.randn(8, 4),

   ...:                   index=dates, columns=['A', 'B', 'C', 'D'])

   ...: 

In [3]: df

Out[3]:

                   A         B         C         D

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

2000-01-02  1.212112 -0.173215  0.119209 -1.044236

2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

2000-01-04  0.721555 -0.706771 -1.039575  0.271860

2000-01-05 -0.424972  0.567020  0.276232 -1.087401

2000-01-06 -0.673690  0.113648 -1.478427  0.524988

2000-01-07  0.404705  0.577046 -1.715002 -1.039268

2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

除非特别说明，否则索引功能都不是时间序列特定的。

因此，如上所述，我们使用最基本的索引[]：

In [4]: s = df['A']

In [5]: s[dates[5]]

Out[5]: -0.6736897080883706

您可以传递列表列表[]以按该顺序选择列。如果DataFrame中未包含列，则会引发异常。也可以这种方式设置多列：

In [6]: df

Out[6]:

                   A         B         C         D

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

2000-01-02  1.212112 -0.173215  0.119209 -1.044236

2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

2000-01-04  0.721555 -0.706771 -1.039575  0.271860

2000-01-05 -0.424972  0.567020  0.276232 -1.087401

2000-01-06 -0.673690  0.113648 -1.478427  0.524988

2000-01-07  0.404705  0.577046 -1.715002 -1.039268

2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [7]: df[['B', 'A']] = df[['A', 'B']]

In [8]: df

Out[8]:

                   A         B         C         D

2000-01-01 -0.282863  0.469112 -1.509059 -1.135632

2000-01-02 -0.173215  1.212112  0.119209 -1.044236

2000-01-03 -2.104569 -0.861849 -0.494929  1.071804

2000-01-04 -0.706771  0.721555 -1.039575  0.271860

2000-01-05  0.567020 -0.424972  0.276232 -1.087401

2000-01-06  0.113648 -0.673690 -1.478427  0.524988

2000-01-07  0.577046  0.404705 -1.715002 -1.039268

2000-01-08 -1.157892 -0.370647 -1.344312  0.844885

您可能会发现这对于将变换（就地）应用于列的子集非常有用。

属性访问

您可以直接访问某个Series或列上的索引DataFrame作为属性：

In [14]: sa = pd.Series([1, 2, 3], index=list('abc'))

In [15]: dfa = df.copy()

In [16]: sa.b

Out[16]: 2

In [17]: dfa.A

Out[17]:

2000-01-01    0.469112

2000-01-02    1.212112

2000-01-03   -0.861849

2000-01-04    0.721555

2000-01-05   -0.424972

2000-01-06   -0.673690

2000-01-07    0.404705

2000-01-08   -0.370647

Freq: D, Name: A, dtype: float64

In [18]: sa.a = 5

In [19]: sa

Out[19]:

a    5

b    2

c    3

dtype: int64

In [20]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists

In [21]: dfa

Out[21]:

            A         B         C         D

2000-01-01  0 -0.282863 -1.509059 -1.135632

2000-01-02  1 -0.173215  0.119209 -1.044236

2000-01-03  2 -2.104569 -0.494929  1.071804

2000-01-04  3 -0.706771 -1.039575  0.271860

2000-01-05  4  0.567020  0.276232 -1.087401

2000-01-06  5  0.113648 -1.478427  0.524988

2000-01-07  6  0.577046 -1.715002 -1.039268

2000-01-08  7 -1.157892 -1.344312  0.844885

In [22]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column

In [23]: dfa

Out[23]:

            A         B         C         D

2000-01-01  0 -0.282863 -1.509059 -1.135632

2000-01-02  1 -0.173215  0.119209 -1.044236

2000-01-03  2 -2.104569 -0.494929  1.071804

2000-01-04  3 -0.706771 -1.039575  0.271860

2000-01-05  4  0.567020  0.276232 -1.087401

2000-01-06  5  0.113648 -1.478427  0.524988

2000-01-07  6  0.577046 -1.715002 -1.039268

2000-01-08  7 -1.157892 -1.344312  0.844885

如果您使用的是IPython环境，则还可以使用tab-completion来查看这些可访问的属性。

您还可以将a分配dict给一行DataFrame：

In [24]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [25]: x.iloc[1] = {'x': 9, 'y': 99}

In [26]: x

Out[26]:

   x   y

0  1   3

1  9  99

2  3   5

您可以使用属性访问来修改DataFrame的Series或列的现有元素，但要小心; 如果您尝试使用属性访问权来创建新列，则会创建新属性而不是新列。在0.21.0及更高版本中，这将引发UserWarning：

In [1]: df = pd.DataFrame({'one': [1., 2., 3.]})

In [2]: df.two = [4, 5, 6]

UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access

In [3]: df

Out[3]:

   one

0  1.0

1  2.0

2  3.0

切片范围

沿着任意轴切割范围的最稳健和一致的方法在详细说明该方法的“ 按位置选择”部分中描述.iloc。现在，我们解释使用[]运算符切片的语义。

使用Series，语法与ndarray完全一样，返回值的一部分和相应的标签：

In [27]: s[:5]

Out[27]:

2000-01-01    0.469112

2000-01-02    1.212112

2000-01-03   -0.861849

2000-01-04    0.721555

2000-01-05   -0.424972

Freq: D, Name: A, dtype: float64

In [28]: s[::2]

Out[28]:

2000-01-01    0.469112

2000-01-03   -0.861849

2000-01-05   -0.424972

2000-01-07    0.404705

Freq: 2D, Name: A, dtype: float64

In [29]: s[::-1]

Out[29]:

2000-01-08   -0.370647

2000-01-07    0.404705

2000-01-06   -0.673690

2000-01-05   -0.424972

2000-01-04    0.721555

2000-01-03   -0.861849

2000-01-02    1.212112

2000-01-01    0.469112

Freq: -1D, Name: A, dtype: float64

请注意，设置也适用：

In [30]: s2 = s.copy()

In [31]: s2[:5] = 0

In [32]: s2

Out[32]:

2000-01-01    0.000000

2000-01-02    0.000000

2000-01-03    0.000000

2000-01-04    0.000000

2000-01-05    0.000000

2000-01-06   -0.673690

2000-01-07    0.404705

2000-01-08   -0.370647

Freq: D, Name: A, dtype: float64

使用DataFrame，切片内部[] 切片。这主要是为了方便而提供的，因为它是如此常见的操作。

In [33]: df[:3]

Out[33]:

                   A         B         C         D

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

2000-01-02  1.212112 -0.173215  0.119209 -1.044236

2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [34]: df[::-1]

Out[34]:

                   A         B         C         D

2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

2000-01-07  0.404705  0.577046 -1.715002 -1.039268

2000-01-06 -0.673690  0.113648 -1.478427  0.524988

2000-01-05 -0.424972  0.567020  0.276232 -1.087401

2000-01-04  0.721555 -0.706771 -1.039575  0.271860

2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

2000-01-02  1.212112 -0.173215  0.119209 -1.044236

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632

按标签选择

pandas提供了一套方法，以便拥有纯粹基于标签的索引。这是一个严格的包含协议。要求的每个标签必须在索引中，否则KeyError将被提出。切片时，如果索引中存在，则包括起始绑定和停止边界。整数是有效标签，但它们是指标签而不是位置。******

该.loc属性是主要访问方法。以下是有效输入：

单个标签，例如5或'a'（注意，它5被解释为索引的标签。此用法不是索引的整数位置。）。
列表或标签数组。['a', 'b', 'c']
带有标签的切片对象'a':'f'（注意，与通常的python切片相反，包括起始和停止，当存在于索引中时！请参见切片标签。
布尔数组。
A callable，参见按可调用选择。

In [38]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))

In [39]: s1

Out[39]:

a    1.431256

b    1.340309

c   -1.170299

d   -0.226169

e    0.410835

f    0.813850

dtype: float64

In [40]: s1.loc['c':]

Out[40]:

c   -1.170299

d   -0.226169

e    0.410835

f    0.813850

dtype: float64

In [41]: s1.loc['b']

Out[41]: 1.3403088497993827

请注意，设置也适用：

In [42]: s1.loc['c':] = 0

In [43]: s1

Out[43]:

a    1.431256

b    1.340309

c    0.000000

d    0.000000

e    0.000000

f    0.000000

dtype: float64

使用DataFrame：

In [44]: df1 = pd.DataFrame(np.random.randn(6, 4),

   ....:                    index=list('abcdef'),

   ....:                    columns=list('ABCD'))

   ....: 

In [45]: df1

Out[45]:

          A         B         C         D

a  0.132003 -0.827317 -0.076467 -1.187678

b  1.130127 -1.436737 -1.413681  1.607920

c  1.024180  0.569605  0.875906 -2.211372

d  0.974466 -2.006747 -0.410001 -0.078638

e  0.545952 -1.219217 -1.226825  0.769804

f -1.281247 -0.727707 -0.121306 -0.097883

In [46]: df1.loc[['a', 'b', 'd'], :]

Out[46]:

          A         B         C         D

a  0.132003 -0.827317 -0.076467 -1.187678

b  1.130127 -1.436737 -1.413681  1.607920

d  0.974466 -2.006747 -0.410001 -0.078638

通过标签切片访问：

In [47]: df1.loc['d':, 'A':'C']

Out[47]:

          A         B         C

d  0.974466 -2.006747 -0.410001

e  0.545952 -1.219217 -1.226825

f -1.281247 -0.727707 -0.121306

使用标签获取横截面（相当于df.xs('a')）：

In [48]: df1.loc['a']

Out[48]:

A    0.132003

B   -0.827317

C   -0.076467

D   -1.187678

Name: a, dtype: float64

要使用布尔数组获取值：

In [49]: df1.loc['a'] > 0

Out[49]:

A     True

B    False

C    False

D    False

Name: a, dtype: bool

In [50]: df1.loc[:, df1.loc['a'] > 0]

Out[50]:

          A

a  0.132003

b  1.130127

c  1.024180

d  0.974466

e  0.545952

f -1.281247

要明确获取值（相当于已弃用df.get_value('a','A')）：

# this is also equivalent to ``df1.at['a','A']``

In [51]: df1.loc['a', 'A']

Out[51]: 0.13200317033032932

用标签切片

使用.loc切片时，如果索引中存在开始和停止标签，则返回位于两者之间的元素（包括它们）：

In [52]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])

In [53]: s.loc[3:5]

Out[53]:

3    b

2    c

5    d

dtype: object

如果两个中至少有一个不存在，但索引已排序，并且可以与开始和停止标签进行比较，那么通过选择在两者之间排名的标签，切片仍将按预期工作：

In [54]: s.sort_index()

Out[54]:

0    a

2    c

3    b

4    e

5    d

dtype: object

In [55]: s.sort_index().loc[1:6]

Out[55]:

2    c

3    b

4    e

5    d

dtype: object

然而，如果两个中的至少一个不存在并且索引未被排序，则将引发错误（因为否则将是计算上昂贵的，并且对于混合类型索引可能是模糊的）。例如，在上面的例子中，s.loc[1:6]会提高KeyError。

按位置选择

Pandas提供了一套方法，以获得纯粹基于整数的索引。语义紧跟Python和NumPy切片。这些是0-based索引。切片时，所结合的开始被包括，而上限是排除。尝试使用非整数，甚至是有效的标签都会引发一个问题IndexError。

该.iloc属性是主要访问方法。以下是有效输入：

一个整数，例如5。
整数列表或数组。[4, 3, 0]
带有整数的切片对象1:7。
布尔数组。

In [56]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))

In [57]: s1

Out[57]:

0    0.695775

2    0.341734

4    0.959726

6   -1.110336

8   -0.619976

dtype: float64

In [58]: s1.iloc[:3]

Out[58]:

0    0.695775

2    0.341734

4    0.959726

dtype: float64

In [59]: s1.iloc[3]

Out[59]: -1.110336102891167

请注意，设置也适用：

In [60]: s1.iloc[:3] = 0

In [61]: s1

Out[61]:

0    0.000000

2    0.000000

4    0.000000

6   -1.110336

8   -0.619976

dtype: float64

使用DataFrame：

In [62]: df1 = pd.DataFrame(np.random.randn(6, 4),

   ....:                    index=list(range(0, 12, 2)),

   ....:                    columns=list(range(0, 8, 2)))

   ....: 

In [63]: df1

Out[63]:

           0         2         4         6

0   0.149748 -0.732339  0.687738  0.176444

2   0.403310 -0.154951  0.301624 -2.179861

4  -1.369849 -0.954208  1.462696 -1.743161

6  -0.826591 -0.345352  1.314232  0.690579

8   0.995761  2.396780  0.014871  3.357427

10 -0.317441 -1.236269  0.896171 -0.487602

通过整数切片选择：

In [64]: df1.iloc[:3]

Out[64]:

          0         2         4         6

0  0.149748 -0.732339  0.687738  0.176444

2  0.403310 -0.154951  0.301624 -2.179861

4 -1.369849 -0.954208  1.462696 -1.743161

In [65]: df1.iloc[1:5, 2:4]

Out[65]:

          4         6

2  0.301624 -2.179861

4  1.462696 -1.743161

6  1.314232  0.690579

8  0.014871  3.357427

通过整数列表选择：

In [66]: df1.iloc[[1, 3, 5], [1, 3]]

Out[66]:

           2         6

2  -0.154951 -2.179861

6  -0.345352  0.690579

10 -1.236269 -0.487602

In [67]: df1.iloc[1:3, :]

Out[67]:

          0         2         4         6

2  0.403310 -0.154951  0.301624 -2.179861

4 -1.369849 -0.954208  1.462696 -1.743161

In [68]: df1.iloc[:, 1:3]

Out[68]:

           2         4

0  -0.732339  0.687738

2  -0.154951  0.301624

4  -0.954208  1.462696

6  -0.345352  1.314232

8   2.396780  0.014871

10 -1.236269  0.896171

# this is also equivalent to ``df1.iat[1,1]``

In [69]: df1.iloc[1, 1]

Out[69]: -0.1549507744249032

使用整数位置（等效df.xs(1)）得到横截面：

In [70]: df1.iloc[1]

Out[70]:

0    0.403310

2   -0.154951

4    0.301624

6   -2.179861

Name: 2, dtype: float64

超出范围的切片索引正如Python / Numpy中一样优雅地处理。

# these are allowed in python/numpy.

In [71]: x = list('abcdef')

In [72]: x

Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']

In [73]: x[4:10]

Out[73]: ['e', 'f']

In [74]: x[8:10]

Out[74]: []

In [75]: s = pd.Series(x)

In [76]: s

Out[76]:

0    a

1    b

2    c

3    d

4    e

5    f

dtype: object

In [77]: s.iloc[4:10]

Out[77]:

4    e

5    f

dtype: object

In [78]: s.iloc[8:10]

Out[78]: Series([], dtype: object)

请注意，使用超出边界的切片可能会导致空轴（例如，返回一个空的DataFrame）。

In [79]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [80]: dfl

Out[80]:

          A         B

0 -0.082240 -2.182937

1  0.380396  0.084844

2  0.432390  1.519970

3 -0.493662  0.600178

4  0.274230  0.132885

In [81]: dfl.iloc[:, 2:3]

Out[81]:

Empty DataFrame

Columns: []

Index: [0, 1, 2, 3, 4]

In [82]: dfl.iloc[:, 1:3]

Out[82]:

          B

0 -2.182937

1  0.084844

2  1.519970

3  0.600178

4  0.132885

In [83]: dfl.iloc[4:6]

Out[83]:

         A         B

4  0.27423  0.132885

一个超出范围的索引器会引发一个IndexError。任何元素超出范围的索引器列表都会引发 IndexError。

>>> dfl.iloc[[4, 5, 6]]

IndexError: positional indexers are out-of-bounds

>>> dfl.iloc[:, 4]

IndexError: single positional indexer is out-of-bounds

通过可调用选择

.loc，.iloc以及[]索引也可以接受一个callable索引器。在callable必须与一个参数（调用系列或数据帧）返回的有效输出索引功能。

In [84]: df1 = pd.DataFrame(np.random.randn(6, 4),

   ....:                    index=list('abcdef'),

   ....:                    columns=list('ABCD'))

   ....: 

In [85]: df1

Out[85]:

          A         B         C         D

a -0.023688  2.410179  1.450520  0.206053

b -0.251905 -2.213588  1.063327  1.266143

c  0.299368 -0.863838  0.408204 -1.048089

d -0.025747 -0.988387  0.094055  1.262731

e  1.289997  0.082423 -0.055758  0.536580

f -0.489682  0.369374 -0.034571 -2.484478

In [86]: df1.loc[lambda df: df.A > 0, :]

Out[86]:

          A         B         C         D

c  0.299368 -0.863838  0.408204 -1.048089

e  1.289997  0.082423 -0.055758  0.536580

In [87]: df1.loc[:, lambda df: ['A', 'B']]

Out[87]:

          A         B

a -0.023688  2.410179

b -0.251905 -2.213588

c  0.299368 -0.863838

d -0.025747 -0.988387

e  1.289997  0.082423

f -0.489682  0.369374

In [88]: df1.iloc[:, lambda df: [0, 1]]

Out[88]:

          A         B

a -0.023688  2.410179

b -0.251905 -2.213588

c  0.299368 -0.863838

d -0.025747 -0.988387

e  1.289997  0.082423

f -0.489682  0.369374

In [89]: df1[lambda df: df.columns[0]]

Out[89]:

a   -0.023688

b   -0.251905

c    0.299368

d   -0.025747

e    1.289997

f   -0.489682

Name: A, dtype: float64

您可以使用可调用索引Series。

In [90]: df1.A.loc[lambda s: s > 0]

Out[90]:

c    0.299368

e    1.289997

Name: A, dtype: float64

使用这些方法/索引器，您可以在不使用临时变量的情况下链接数据选择操作。

In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [92]: (bb.groupby(['year', 'team']).sum()

   ....:    .loc[lambda df: df.r > 100])

   ....:

Out[92]:

           stint    g    ab    r    h  X2b  X3b  hr    rbi    sb   cs   bb     so   ibb   hbp    sh    sf  gidp

year team

2007 CIN       6  379   745  101  203   35    2  36  125.0  10.0  1.0  105  127.0  14.0   1.0   1.0  15.0  18.0

     DET       5  301  1062  162  283   54    4  37  144.0  24.0  7.0   97  176.0   3.0  10.0   4.0   8.0  28.0

     HOU       4  311   926  109  218   47    6  14   77.0  10.0  4.0   60  212.0   3.0   9.0  16.0   6.0  17.0

     LAN      11  413  1021  153  293   61    3  36  154.0   7.0  5.0  114  141.0   8.0   9.0   3.0   8.0  29.0

     NYN      13  622  1854  240  509  101    3  61  243.0  22.0  4.0  174  310.0  24.0  23.0  18.0  15.0  48.0

     SFN       5  482  1305  198  337   67    6  40  171.0  26.0  7.0  235  188.0  51.0   8.0  16.0   6.0  41.0

     TEX       2  198   729  115  200   40    4  28  115.0  21.0  4.0   73  140.0   4.0   5.0   2.0   8.0  16.0

     TOR       4  459  1408  187  378   96    2  58  223.0   4.0  2.0  190  265.0  16.0  12.0   4.0  16.0  38.0

快速标量值获取和设置

因为索引[]必须处理很多情况（单标签访问，切片，布尔索引等），所以它有一些开销以便弄清楚你要求的是什么。如果您只想访问标量值，最快的方法是使用在所有数据结构上实现的at和iat方法。

与之类似loc，at提供基于标签的标量查找，同时iat提供类似于基于整数的查找iloc

In [136]: s.iat[5]

Out[136]: 5

In [137]: df.at[dates[5], 'A']

Out[137]: -0.6736897080883706

In [138]: df.iat[3, 0]

Out[138]: 0.7215551622443669

您也可以使用这些相同的索引器进行设置。

In [139]: df.at[dates[5], 'E'] = 7

In [140]: df.iat[3, 0] = 7

at 如果索引器丢失，可以如上所述放大对象

In [141]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7

In [142]: df

Out[142]:

                   A         B         C         D    E    0

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632  NaN  NaN

2000-01-02  1.212112 -0.173215  0.119209 -1.044236  NaN  NaN

2000-01-03 -0.861849 -2.104569 -0.494929  1.071804  NaN  NaN

2000-01-04  7.000000 -0.706771 -1.039575  0.271860  NaN  NaN

2000-01-05 -0.424972  0.567020  0.276232 -1.087401  NaN  NaN

2000-01-06 -0.673690  0.113648 -1.478427  0.524988  7.0  NaN

2000-01-07  0.404705  0.577046 -1.715002 -1.039268  NaN  NaN

2000-01-08 -0.370647 -1.157892 -1.344312  0.844885  NaN  NaN

2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0

布尔索引

另一种常见操作是使用布尔向量来过滤数据。运营商是：|for or，&for and和~for not。必须使用括号对这些进行分组，因为默认情况下，Python将评估表达式，例如as ，而期望的评估顺序是。df.A > 2 & df.B < 3````df.A > (2 & df.B) < 3````(df.A > 2) & (df.B < 3)

使用布尔向量索引系列的工作方式与NumPy ndarray完全相同

In [143]: s = pd.Series(range(-3, 4))

In [144]: s

Out[144]:

0   -3

1   -2

2   -1

3    0

4    1

5    2

6    3

dtype: int64

In [145]: s[s > 0]

Out[145]:

4    1

5    2

6    3

dtype: int64

In [146]: s[(s < -1) | (s > 0.5)]

Out[146]:

0   -3

1   -2

4    1

5    2

6    3

dtype: int64

In [147]: s[~(s < 0)]

Out[147]:

3    0

4    1

5    2

6    3

dtype: int64

您可以使用与DataFrame索引长度相同的布尔向量从DataFrame中选择行（例如，从DataFrame的其中一列派生的东西）：

In [148]: df[df['A'] > 0]

Out[148]:

                   A         B         C         D   E   0

2000-01-01  0.469112 -0.282863 -1.509059 -1.135632 NaN NaN

2000-01-02  1.212112 -0.173215  0.119209 -1.044236 NaN NaN

2000-01-04  7.000000 -0.706771 -1.039575  0.271860 NaN NaN

2000-01-07  0.404705  0.577046 -1.715002 -1.039268 NaN NaN

列表推导和map系列方法也可用于产生更复杂的标准：

In [149]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],

   .....:                     'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],

   .....:                     'c': np.random.randn(7)})

   .....: 

# only want 'two' or 'three'

In [150]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [151]: df2[criterion]

Out[151]:

       a  b         c

2    two  y  0.041290

3  three  x  0.361719

4    two  y -0.238075

# equivalent but slower

In [152]: df2[[x.startswith('t') for x in df2['a']]]

Out[152]:

       a  b         c

2    two  y  0.041290

3  three  x  0.361719

4    two  y -0.238075

# Multiple criteria

In [153]: df2[criterion & (df2['b'] == 'x')]

Out[153]:

       a  b         c

3  three  x  0.361719

随着选择方法通过标签选择，通过位置选择和高级索引，你可以沿着使用布尔向量与其他索引表达式中组合选择多个轴。

In [154]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']

Out[154]:

   b         c

3  x  0.361719

使用isin进行索引

考虑一下isin()方法Series，该方法返回一个布尔向量，只要Series元素存在于传递列表中，该向量就为真。这允许您选择一列或多列具有所需值的行：

In [155]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [156]: s

Out[156]:

4    0

3    1

2    2

1    3

0    4

dtype: int64

In [157]: s.isin([2, 4, 6])

Out[157]:

4    False

3    False

2     True

1    False

0     True

dtype: bool

In [158]: s[s.isin([2, 4, 6])]

Out[158]:

2    2

0    4

dtype: int64

Index对象可以使用相同的方法，当您不知道哪些搜索标签实际存在时，它们非常有用：

In [159]: s[s.index.isin([2, 4, 6])]

Out[159]:

4    0

2    2

dtype: int64

# compare it to the following

In [160]: s.reindex([2, 4, 6])

Out[160]:

2    2.0

4    0.0

6    NaN

dtype: float64

除此之外，还MultiIndex允许选择在成员资格检查中使用的单独级别：

In [161]: s_mi = pd.Series(np.arange(6),

   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))

   .....: 

In [162]: s_mi

Out[162]:

0  a    0

   b    1

   c    2

1  a    3

   b    4

   c    5

dtype: int64

In [163]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]

Out[163]:

0  c    2

1  a    3

dtype: int64

In [164]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]

Out[164]:

0  a    0

   c    2

1  a    3

   c    5

dtype: int64

DataFrame也有一个isin()方法。调用时isin，将一组值作为数组或字典传递。如果values是一个数组，则isin返回与原始DataFrame形状相同的布尔数据框，并在元素序列中的任何位置使用True。

In [165]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],

   .....:                    'ids2': ['a', 'n', 'c', 'n']})

   .....: 

In [166]: values = ['a', 'b', 1, 3]

In [167]: df.isin(values)

Out[167]:

    vals    ids   ids2

0   True   True   True

1  False   True  False

2   True  False  False

3  False  False  False

通常，您需要将某些值与某些列匹配。只需将值设置dict为键为列的位置，值即为要检查的项目列表。

In [168]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [169]: df.isin(values)

Out[169]:

    vals    ids   ids2

0   True   True  False

1  False   True  False

2   True  False  False

3  False  False  False

结合数据帧的isin同any()和all()方法来快速选择符合给定的标准对数据子集。要选择每列符合其自己标准的行：

In [170]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [171]: row_mask = df.isin(values).all(1)

In [172]: df[row_mask]

Out[172]:

   vals ids ids2

0     1   a    a

`where()`方法和屏蔽

从具有布尔向量的Series中选择值通常会返回数据的子集。为了保证选择输出与原始数据具有相同的形状，您可以where在Series和中使用该方法DataFrame。

仅返回选定的行：

In [173]: s[s > 0]

Out[173]:

3    1

2    2

1    3

0    4

dtype: int64

要返回与原始形状相同的系列：

In [174]: s.where(s > 0)

Out[174]:

4    NaN

3    1.0

2    2.0

1    3.0

0    4.0

dtype: float64

现在，使用布尔标准从DataFrame中选择值也可以保留输入数据形状。where在引擎盖下用作实现。下面的代码相当于。df.where(df < 0)

In [175]: df[df < 0]

Out[175]:

                   A         B         C         D

2000-01-01 -2.104139 -1.309525       NaN       NaN

2000-01-02 -0.352480       NaN -1.192319       NaN

2000-01-03 -0.864883       NaN -0.227870       NaN

2000-01-04       NaN -1.222082       NaN -1.233203

2000-01-05       NaN -0.605656 -1.169184       NaN

2000-01-06       NaN -0.948458       NaN -0.684718

2000-01-07 -2.670153 -0.114722       NaN -0.048048

2000-01-08       NaN       NaN -0.048788 -0.808838

此外，在返回的副本中，where使用可选other参数替换条件为False的值。

In [176]: df.where(df < 0, -df)

Out[176]:

                   A         B         C         D

2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166

2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824

2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059

2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203

2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416

2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718

2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048

2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

您可能希望根据某些布尔条件设置值。这可以直观地完成，如下所示：

In [177]: s2 = s.copy()

In [178]: s2[s2 < 0] = 0

In [179]: s2

Out[179]:

4    0

3    1

2    2

1    3

0    4

dtype: int64

In [180]: df2 = df.copy()

In [181]: df2[df2 < 0] = 0

In [182]: df2

Out[182]:

                   A         B         C         D

2000-01-01  0.000000  0.000000  0.485855  0.245166

2000-01-02  0.000000  0.390389  0.000000  1.655824

2000-01-03  0.000000  0.299674  0.000000  0.281059

2000-01-04  0.846958  0.000000  0.600705  0.000000

2000-01-05  0.669692  0.000000  0.000000  0.342416

2000-01-06  0.868584  0.000000  2.297780  0.000000

2000-01-07  0.000000  0.000000  0.168904  0.000000

2000-01-08  0.801196  1.392071  0.000000  0.000000

默认情况下，where返回数据的修改副本。有一个可选参数，inplace以便可以在不创建副本的情况下修改原始数据：

In [183]: df_orig = df.copy()

In [184]: df_orig.where(df > 0, -df, inplace=True)

In [185]: df_orig

Out[185]:

                   A         B         C         D

2000-01-01  2.104139  1.309525  0.485855  0.245166

2000-01-02  0.352480  0.390389  1.192319  1.655824

2000-01-03  0.864883  0.299674  0.227870  0.281059

2000-01-04  0.846958  1.222082  0.600705  1.233203

2000-01-05  0.669692  0.605656  1.169184  0.342416

2000-01-06  0.868584  0.948458  2.297780  0.684718

2000-01-07  2.670153  0.114722  0.168904  0.048048

2000-01-08  0.801196  1.392071  0.048788  0.808838

重复数据

如果要识别和删除DataFrame中的重复行，有两种方法可以提供帮助：duplicated和drop_duplicates。每个都将用于标识重复行的列作为参数。

duplicated 返回一个布尔向量，其长度为行数，表示行是否重复。
drop_duplicates 删除重复的行。

默认情况下，重复集的第一个观察行被认为是唯一的，但每个方法都有一个keep参数来指定要保留的目标。

keep='first' （默认值）：标记/删除重复项，第一次出现除外。
keep='last'：标记/删除重复项，除了最后一次出现。
keep=False：标记/删除所有重复项。

In [264]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],

   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],

   .....:                     'c': np.random.randn(7)})

   .....: 

In [265]: df2

Out[265]:

       a  b         c

0    one  x -1.067137

1    one  y  0.309500

2    two  x -0.211056

3    two  y -1.842023

4    two  x -0.390820

5  three  x -1.964475

6   four  x  1.298329

In [266]: df2.duplicated('a')

Out[266]:

0    False

1     True

2    False

3     True

4     True

5    False

6    False

dtype: bool

In [267]: df2.duplicated('a', keep='last')

Out[267]:

0     True

1    False

2     True

3     True

4    False

5    False

6    False

dtype: bool

In [268]: df2.duplicated('a', keep=False)

Out[268]:

0     True

1     True

2     True

3     True

4     True

5    False

6    False

dtype: bool

In [269]: df2.drop_duplicates('a')

Out[269]:

       a  b         c

0    one  x -1.067137

2    two  x -0.211056

5  three  x -1.964475

6   four  x  1.298329

In [270]: df2.drop_duplicates('a', keep='last')

Out[270]:

       a  b         c

1    one  y  0.309500

4    two  x -0.390820

5  three  x -1.964475

6   four  x  1.298329

In [271]: df2.drop_duplicates('a', keep=False)

Out[271]:

       a  b         c

5  three  x -1.964475

6   four  x  1.298329

此外，您可以传递列表列表以识别重复。

In [272]: df2.duplicated(['a', 'b'])

Out[272]:

0    False

1    False

2    False

3    False

4     True

5    False

6    False

dtype: bool

In [273]: df2.drop_duplicates(['a', 'b'])

Out[273]:

       a  b         c

0    one  x -1.067137

1    one  y  0.309500

2    two  x -0.211056

3    two  y -1.842023

5  three  x -1.964475

6   four  x  1.298329

要按索引值删除重复项，请使用Index.duplicated然后执行切片。keep参数可以使用相同的选项集。

In [274]: df3 = pd.DataFrame({'a': np.arange(6),

   .....:                     'b': np.random.randn(6)},

   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])

   .....: 

In [275]: df3

Out[275]:

   a         b

a  0  1.440455

a  1  2.456086

b  2  1.038402

c  3 -0.894409

b  4  0.683536

a  5  3.082764

In [276]: df3.index.duplicated()

Out[276]: array([False,  True, False, False,  True,  True])

In [277]: df3[~df3.index.duplicated()]

Out[277]:

   a         b

a  0  1.440455

b  2  1.038402

c  3 -0.894409

In [278]: df3[~df3.index.duplicated(keep='last')]

Out[278]:

   a         b

c  3 -0.894409

b  4  0.683536

a  5  3.082764

In [279]: df3[~df3.index.duplicated(keep=False)]

Out[279]:

   a         b

c  3 -0.894409

索引对象

pandas Index类及其子类可以视为实现有序的多集合。允许重复。但是，如果您尝试将Index具有重复条目的对象转换为a set，则会引发异常。

Index还提供了查找，数据对齐和重建索引所需的基础结构。Index直接创建的最简单方法是将一个list或其他序列传递给 Index：

In [285]: index = pd.Index(['e', 'd', 'a', 'b'])

In [286]: index

Out[286]: Index(['e', 'd', 'a', 'b'], dtype='object')

In [287]: 'd' in index

Out[287]: True

您还可以传递一个name存储在索引中：

In [288]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')

In [289]: index.name

Out[289]: 'something'

名称（如果已设置）将显示在控制台显示中：

In [290]: index = pd.Index(list(range(5)), name='rows')

In [291]: columns = pd.Index(['A', 'B', 'C'], name='cols')

In [292]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)

In [293]: df

Out[293]:

cols         A         B         C

rows

0     1.295989  0.185778  0.436259

1     0.678101  0.311369 -0.528378

2    -0.674808 -1.103529 -0.656157

3     1.889957  2.076651 -1.102192

4    -1.211795 -0.791746  0.634724

In [294]: df['A']

Out[294]:

rows

0    1.295989

1    0.678101

2   -0.674808

3    1.889957

4   -1.211795

Name: A, dtype: float64

设置元数据

索引是“不可改变的大多是”，但它可以设置和改变它们的元数据，如指数name（或为MultiIndex，levels和 codes）。

您可以使用rename，set_names，set_levels，和set_codes 直接设置这些属性。他们默认返回一份副本; 但是，您可以指定inplace=True使数据更改到位。

In [295]: ind = pd.Index([1, 2, 3])

In [296]: ind.rename("apple")

Out[296]: Int64Index([1, 2, 3], dtype='int64', name='apple')

In [297]: ind

Out[297]: Int64Index([1, 2, 3], dtype='int64')

In [298]: ind.set_names(["apple"], inplace=True)

In [299]: ind.name = "bob"

In [300]: ind

Out[300]: Int64Index([1, 2, 3], dtype='int64', name='bob')

set_names，set_levels并且set_codes还采用可选 level参数

In [301]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [302]: index

Out[302]:

MultiIndex([(0, 'one'),

            (0, 'two'),

            (1, 'one'),

            (1, 'two'),

            (2, 'one'),

            (2, 'two')],

           names=['first', 'second'])

In [303]: index.levels[1]

Out[303]: Index(['one', 'two'], dtype='object', name='second')

In [304]: index.set_levels(["a", "b"], level=1)

Out[304]:

MultiIndex([(0, 'a'),

            (0, 'b'),

            (1, 'a'),

            (1, 'b'),

            (2, 'a'),

            (2, 'b')],

           names=['first', 'second'])

在Index对象上设置操作

两个主要业务是和。这些可以直接称为实例方法，也可以通过重载运算符使用。通过该方法提供差异。union (|)````intersection (&)````.difference()

In [305]: a = pd.Index(['c', 'b', 'a'])

In [306]: b = pd.Index(['c', 'e', 'd'])

In [307]: a | b

Out[307]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [308]: a & b

Out[308]: Index(['c'], dtype='object')

In [309]: a.difference(b)

Out[309]: Index(['a', 'b'], dtype='object')

同时还提供了操作，它返回出现在任一元件或，但不是在两者。这相当于创建的索引，删除了重复项。symmetric_difference (^)````idx1````idx2````idx1.difference(idx2).union(idx2.difference(idx1))

In [310]: idx1 = pd.Index([1, 2, 3, 4])

In [311]: idx2 = pd.Index([2, 3, 4, 5])

In [312]: idx1.symmetric_difference(idx2)

Out[312]: Int64Index([1, 5], dtype='int64')

In [313]: idx1 ^ idx2

Out[313]: Int64Index([1, 5], dtype='int64')

在Index.union()具有不同dtypes的索引之间执行时，必须将索引强制转换为公共dtype。通常，虽然并非总是如此，但这是对象dtype。例外是在整数和浮点数据之间执行联合。在这种情况下，整数值将转换为float

In [314]: idx1 = pd.Index([0, 1, 2])

In [315]: idx2 = pd.Index([0.5, 1.5])

In [316]: idx1 | idx2

Out[316]: Float64Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

缺少值

即使Index可以保存缺失值（NaN），但如果您不想要任何意外结果，也应该避免使用。例如，某些操作会隐式排除缺失值。

Index.fillna 使用指定的标量值填充缺失值。

In [317]: idx1 = pd.Index([1, np.nan, 3, 4])

In [318]: idx1

Out[318]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [319]: idx1.fillna(2)

Out[319]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [320]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),

   .....:                          pd.NaT,

   .....:                          pd.Timestamp('2011-01-03')])

   .....: 

In [321]: idx2

Out[321]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [322]: idx2.fillna(pd.Timestamp('2011-01-02'))

Out[322]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

设置/重置索引

有时您会将数据集加载或创建到DataFrame中，并希望在您已经完成之后添加索引。有几种不同的方式

设置索引

DataFrame有一个set_index()方法，它采用列名（对于常规Index）或列名列表（对于a MultiIndex）。要创建新的重新索引的DataFrame：

In [323]: data

Out[323]:

     a    b  c    d

0  bar  one  z  1.0

1  bar  two  y  2.0

2  foo  one  x  3.0

3  foo  two  w  4.0

In [324]: indexed1 = data.set_index('c')

In [325]: indexed1

Out[325]:

     a    b    d

c

z  bar  one  1.0

y  bar  two  2.0

x  foo  one  3.0

w  foo  two  4.0

In [326]: indexed2 = data.set_index(['a', 'b'])

In [327]: indexed2

Out[327]:

         c    d

a   b

bar one  z  1.0

    two  y  2.0

foo one  x  3.0

    two  w  4.0

该append关键字选项让你保持现有索引并追加给列一个多指标：

In [328]: frame = data.set_index('c', drop=False)

In [329]: frame = frame.set_index(['a', 'b'], append=True)

In [330]: frame

Out[330]:

           c    d

c a   b

z bar one  z  1.0

y bar two  y  2.0

x foo one  x  3.0

w foo two  w  4.0

其他选项set_index允许您不删除索引列或就地添加索引（不创建新对象）：

In [331]: data.set_index('c', drop=False)

Out[331]:

     a    b  c    d

c

z  bar  one  z  1.0

y  bar  two  y  2.0

x  foo  one  x  3.0

w  foo  two  w  4.0

In [332]: data.set_index(['a', 'b'], inplace=True)

In [333]: data

Out[333]:

         c    d

a   b

bar one  z  1.0

    two  y  2.0

foo one  x  3.0

    two  w  4.0

重置索引

为方便起见，DataFrame上有一个新函数，它将 reset_index()索引值传输到DataFrame的列中并设置一个简单的整数索引。这是反向操作set_index()

In [334]: data

Out[334]:

         c    d

a   b

bar one  z  1.0

    two  y  2.0

foo one  x  3.0

    two  w  4.0

In [335]: data.reset_index()

Out[335]:

     a    b  c    d

0  bar  one  z  1.0

1  bar  two  y  2.0

2  foo  one  x  3.0

3  foo  two  w  4.0

输出更类似于SQL表或记录数组。从索引派生的列的名称是存储在names属性中的名称。

您可以使用level关键字仅删除索引的一部分：

In [336]: frame

Out[336]:

           c    d

c a   b

z bar one  z  1.0

y bar two  y  2.0

x foo one  x  3.0

w foo two  w  4.0

In [337]: frame.reset_index(level=1)

Out[337]:

         a  c    d

c b

z one  bar  z  1.0

y two  bar  y  2.0

x one  foo  x  3.0

w two  foo  w  4.0

reset_index采用一个可选参数drop，如果为true，则只丢弃索引，而不是将索引值放在DataFrame的列中

添加ad hoc索引

如果您自己创建索引，则可以将其分配给index字段

data.index = index