数据分析核心包—

一、pandas简介

　　pandas是一个强大的Python数据分析的工具包，是基于NumPy构建的。

1、pandas的主要功能

　　(1)具备对其功能的数据结构DataFrame、Series

　　(2)集成时间序列功能

　　(3)提供丰富的数学运算和操作

　　(4)灵活处理缺失数据

2、安装和引用

# 安装方法：

# pip install pandas

# 引用方法：

import pandas as pd

二、Series——一维数据对象

　　Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。

# Series创建方式

>>> import pandas as pd

>>> pd.Series([2,3,4,5])

0    2

1    3

2    4

3    5

dtype: int64

>>> pd.Series([2,3,4,5], index=['a','b','c','d'])

a    2

b    3

c    4

d    5

dtype: int64

　　获取值数组和索引数组：values属性和index属性。

　　Series比较像列表（数组）和字典的结合体。

1、Series——使用特性

(1)Series支持array的特性(下标)

# 从ndarray创建Series：Series(arr)

>>> import numpy as np

>>> pd.Series(np.arange(5))

0    0

1    1

2    2

3    3

4    4

dtype: int64

# 与标量运算：sr*2

>>> sr = pd.Series([2,3,4,5], index=['a','b','c','d'])

>>> sr

a    2

b    3

c    4

d    5

dtype: int64

>>> sr*2

a     4

b     6

c     8

d    10

dtype: int64

>>> sr+2

a    4

b    5

c    6

d    7

dtype: int64

# 两个Series运算：sr1+sr2

>>> sr + sr

a     4

b     6

c     8

d    10

dtype: int64

# 索引：sr[0],sr[[1,2,4]]

>>> sr[0]

2

>>> sr[[1,2,3]]

b    3

c    4

d    5

dtype: int64

# 切片：sr[0:2]

>>> sr[0:2]

a    2

b    3

dtype: int64

# 通用函数(最大值、绝对值等)，如：np.abs(sr)

>>> sr.max()

5

>>> np.abs(sr)

a    2

b    3

c    4

d    5

dtype: int64

# 布尔值过滤：sr[sr>0]

>>> sr>4

a    False

b    False

c    False

d     True

dtype: bool

>>> sr[sr>4]

d    5

dtype: int64

(2)Series支持字典的特性(标签)

# 从字典创建Series：Series(dic)

>>> sr = pd.Series({'a':3, 'b':2, 'c':4})

>>> sr

a    3

b    2

c    4

dtype: int64

# in运算：'a' in sr

>>> 'a' in sr

True

>>> 'e' in sr

False

>>> for i in sr:

    print(i)   # 只遍历打印值，而不是打印键

3

2

4

# 键索引：sr['a'], sr[['a','b','c']]

>>> sr['a']

3

>>> sr[['a','b','c']]

a    3

b    2

c    4

dtype: int64

# 获取索引对应及对应值

>>> sr.index

Index(['a', 'b', 'c'], dtype='object')

>>> sr.index[0]

'a'

>>> sr.values

array([1, 2, 3, 4])

>>> sr = pd.Series([1,2,3,4],index=['a','b','c','d'])

sr

a    1

b    2

c    3

d    4

dtype: int64

>>> sr[['a','c']]

a    1

c    3

>>> sr['a':'c']  # 标签形式索引切片(前包后也包)

a    1

b    2

c    3

dtype: int64

2、Series——整数索引问题

　　整数索引的pandas对象往往会使新手抓狂。

>>> sr = pd.Series(np.arange(4.))

>>> sr

0    0.0

1    1.0

2    2.0

3    3.0

dtype: float64

>>> sr[-1]

报错信息

KeyError: -1

>>> sr = pd.Series(np.arange(10))

>>> sr

0    0

1    1

2    2

3    3

4    4

5    5

6    6

7    7

8    8

9    9

dtype: int64

>>> sr2 = sr[5:].copy()   # 切片后复制

>>> sr2  # 可以看到索引还是保留之前的值

5    5

6    6

7    7

8    8

9    9

dtype: int64

　　如果索引是整数类型，则根据整数进行下标获取值时总是面向标签的。（意思是说，当索引值为整数时，索引一定会解释为标签）

　　解决方法：

# loc属性：将索引解释为标签

>>> sr2.loc[7]

7

# iloc属性：将索引解释为下标

sr2.iloc[3]

8

　　因此涉及到整数的时候一定要loc和iloc指明，中括号里的索引是标签还是下标。

3、Series——数据对齐

　　pandas在进行两个Series对象的运算时，会按索引进行对齐然后计算。

（1）Series对象运算

>>> sr1 = pd.Series([12,23,34], index=['c','a','d'])

>>> sr2 = pd.Series([11,20,10], index=['d','c','a'])

>>> sr1 + sr2

a    33     # 23+10

c    32     # 12+20

d    45     # 34+11

dtype: int64

>>> sr1 = pd.Series([12,23,34], index=['c','a','d'])

>>> sr2 = pd.Series([11,20,10,21], index=['d','c','a','b'])

>>> sr1 + sr2   # 不一样长Series相加

a    33.0

b     NaN   # 在pandas中用来当做数据缺失值

c    32.0

d    45.0

dtype: float64

>>> sr1 = pd.Series([12,23,34], index=['c','a','d'])

>>> sr2 = pd.Series([11,20,10], index=['b','c','a'])

>>> sr1 + sr2

a    33.0

b     NaN

c    32.0

d     NaN

dtype: float64

　　如果两个Series对象的索引不完全相同，则结果的索引是两个操作数索引的并集。

　　如果只有一个对象在某索引下有值，则结果中该索引的值为nan(缺失值)。

（2）灵活算术方法

　　灵活算术方法：add,sub,div,mul（分别对应加减乘除）。

>>> sr1 = pd.Series([12,23,34], index=['c','a','d'])

>>> sr2 = pd.Series([11,20,10], index=['b','c','a'])

>>> sr1.add(sr2)

a    33.0

b     NaN

c    32.0

d     NaN

dtype: float64

>>> sr1.add(sr2, fill_value=0)    # 标签对应的值一个有一个没有，没有的那个赋值为0

a    33.0

b    11.0

c    32.0

d    34.0

dtype: float64

4、Series——缺失值处理

　　缺失数据：使用NaN(Not a Number)来表示缺失数据。其值等于np.nan。

　　内置的None值也会被当做NaN处理。

(1)处理缺失数据的相关方法

>>> sr = sr1+sr2

>>> sr

a    33.0

b     NaN

c    32.0

d     NaN

dtype: float64

# dropna():过滤掉值为NaN的行

# fillna():填充缺失数据

# isnull():返回布尔数组，缺失值对应为True(判断是否为缺失数据)

>>> sr.isnull()

a    False

b     True   # True的是NaN

c    False

d     True

dtype: bool

# notnull():返回布尔数组，缺失值对应为False

sr.notnull()

a     True

b    False   # False对应NaN

c     True

d    False

dtype: bool

(2)缺失值处理方式一：过滤缺失数据

# sr.dropna()

>>> sr.dropna()

a    33.0

c    32.0

dtype: float64

# sr[data.notnull()]

>>> sr[sr.notnull()]   # 剔除所有缺失值的行

a    33.0

c    32.0

dtype: float64

(3)缺失值处理方式二：填充缺失数据

# fillna()

>>> sr.fillna(0)   # 给缺失值赋值为0

a    33.0

b     0.0

c    32.0

d     0.0

dtype: float64

>>> sr.mean()     # 剔除NaN求得平均值

32.5

>>> sr.fillna(sr.mean())   # 给缺失值填充平均值

a    33.0

b    32.5

c    32.0

d    32.5

dtype: float64

5、Series数据对象小结

　　Series是数组和字典的结合体，可以通过下标和标签来访问。

　　当索引值为整数时，索引一定会解释为标签。可以使用loc和iloc来明确指明索引被解释为标签还是下标。

　　如果两个Series对象的索引不完全相同，则结果的索引是两个操作数索引的并集。

　　如果只有一个对象在某索引下有值，则结果中该索引的值为nan(缺失值)。

　　缺失数据处理方法：dropna（过滤）、fillna（填充）。

三、DataFrame——二维数据对象

　　DataFrame是一个表格式的数据结构，含有一组有序的列（即：好几列）。

　　DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

# 创建方式：

# 方法一：通过一个字典来创建

>>> pd.DataFrame({'one':[1,2,3],'two':[4,5,6]})

   one  two

0    1    4

1    2    5

2    3    6

>>> pd.DataFrame({'one':[1,2,3],'two':[4,5,6]}, index=['a','b','c'])  # index指定行索引

   one  two

a    1    4

b    2    5

c    3    6

# 方法二：用Series来组成字典

>>> pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

# MacBook-Pro:pandas hqs$ vi test.csv   # 创建并写入csv文件

# a,b,c

# 1,2,3

# 2,4,6

# 3,6,9

# csv文件读取和写入：

>>> pd.read_csv('test.csv')    # read_csv():读取csv文件

   a  b  c

0  1  2  3

1  2  4  6

2  3  6  9

>>> df = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

>>> df

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

>>> df .to_csv('test2.csv')    # to_csv():写入csv文件

# MacBook-Pro:pandas hqs$ vi test2.csv   # 查看csv文件，缺失的值自动为空

# ,one,two

# a,1.0,2

# b,2.0,1

# c,3.0,3

# d,,4

1、DataFrame——常用属性

>>> df= pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

>>> df

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

# index:获取行索引

>>> df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

# columns:获取列索引

>>> df.columns

Index(['one', 'two'], dtype='object')

# values:获取值数组（一般是二维数组）

>>> df.values

array([[ 1.,  2.],

       [ 2.,  1.],

       [ 3.,  3.],

       [nan,  4.]])

# T:转置

>>> df

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

>>> df.T      # 行变成列，列变成行

       a    b    c    d

one  1.0  2.0  3.0  NaN

two  2.0  1.0  3.0  4.0

# describe():获取快速统计

>>> df.describe()

       one       two

count  3.0  4.000000    # 统计每一列个数

mean   2.0  2.500000    # 统计每一列平均数

std    1.0  1.290994    # 统计每一列标准差

min    1.0  1.000000    # 统计每一列最小值

25%    1.5  1.750000    # 1/4位上的数

50%    2.0  2.500000    # 1/2位上的数

75%    2.5  3.250000    # 3/4位上的数

max    3.0  4.000000    # 统计每一列最大值

2、DataFrame——索引和切片

　　DataFrame是一个二维数据类型，所以有行索引和列索引。

>>> df

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

>>> df['one']['a']   # 先列后行,一列是一个Series

1.0

　　DataFrame同样可以通过标签和位置两种方法进行索引和切片。

　　loc属性和iloc属性：loc是按索引选取数据，iloc是按位置(下标)选取数据。

# 使用方法：逗号隔开，前面是行索引，后面是列索引

>>> df.loc['a','one']    # 先行后列

1.0

# 行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配

>>> df.loc['a',:]   # 选择a这一行，列选择全部

one    1.0

two    2.0

Name: a, dtype: float64

>>> df.loc['a',]    # 效果同上

one    1.0

two    2.0

Name: a, dtype: float64

>>> df.loc[['a','c'],:]  # 选择a、c这两行，列选择全部

   one  two

a  1.0    2

c  3.0    3

>>> df.loc[['a','c'],'two']

a    2

c    3

Name: two, dtype: int64

>>> df.apply(lambda x:x+1)

   one  two

a  2.0    3

b  3.0    2

c  4.0    4

d  NaN    5

>>> df.apply(lambda x:x.mean())

one    2.0

two    2.5

dtype: float64

3、DataFrame——数据对齐和缺失数据处理

　　DataFrame对象在运算时，同样会进行数据对齐，其行索引和列索引分别对齐。

>>> df = pd.DataFrame({'two':[1,2,3,4],'one':[4,5,6,7]}, index=['c','d','b','a'])

>>> df2 = pd.DataFrame({'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([1,2,3,4],index=['b','a','c','d'])})

>>> df

   two  one

c    1    4

d    2    5

b    3    6

a    4    7

>>> df2

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  NaN    4

>>> df + df2

   one  two

a  8.0    6

b  8.0    4

c  7.0    4

d  NaN    6

　　DataFrame处理缺失数据的相关方法：

# df.fillna(x)：用x替换DataFrame对象中所有的空值

>>> df2.fillna(0)

   one  two

a  1.0    2

b  2.0    1

c  3.0    3

d  0.0    4

>>> df2.loc['d','two']=np.nan    # 给df2修改添加缺失值

>>> df2.loc['c','two']=np.nan

>>> df2

   one  two

a  1.0  2.0

b  2.0  1.0

c  3.0  NaN

d  NaN  NaN

# df.dropna()：删除所有包含空值的行，how的默认参数是any

>>> df2.dropna()

   one  two

a  1.0  2.0

b  2.0  1.0

>>> df2.dropna(how='all')   # 删除所有值都为缺失值的行，how的默认参数是any

   one  two

a  1.0  2.0

b  2.0  1.0

c  3.0  NaN

>>> df2.dropna(how='any')

   one  two

a  1.0  2.0

b  2.0  1.0

# df.dropna(axis=1)：删除所有包含空值的列,axis(轴)默认是0

>>> df.loc['c','one']=np.nan    # 给df修改添加缺失值

>>> df

   two  one

c    1  NaN

d    2  5.0

b    3  6.0

a    4  7.0

>>> df.dropna(axis=1)

   two

c    1

d    2

b    3

a    4

# df.dropna(axis=1,thresh=n)：删除所有小于n个非空值的行

>>> df.dropna(axis=1, thresh=4)

   two

c    1

d    2

b    3

a    4

# df.isnull()：检查DataFrame对象中的空值，并返回一个Boolean数组

>>> df2.isnull()

     one    two

a  False  False

b  False  False

c  False   True

d   True   True

# df.notnull()：检查DataFrame对象中的非空值，并返回一个Boolean数组

>>> df2.notnull()

     one    two

a   True   True

b   True   True

c   True  False

d  False  False

四、pandas常用函数(方法)

>>> df

   two  one

c    1  NaN

d    2  5.0

b    3  6.0

a    4  7.0

# mean(axis=0,skipna=False):对列（行）求平均值

>>> df.mean()         # 忽略缺失值，默认对每一列求平均值

two    2.5

one    6.0

dtype: float64

>>> df.mean(axis=1)   # 忽略缺失值，对每一行求平均值

c    1.0

d    3.5

b    4.5

a    5.5

dtype: float64

# sum(axis=1):对列（行）求和

>>> df.sum()          # 对每一列求和

two    10.0

one    18.0

dtype: float64

>>> df.sum(axis=1)    # 对每一行求和

c     1.0

d     7.0

b     9.0

a    11.0

dtype: float64

# sort_index(axis,...,ascending):对列（行）索引排序

>>> df.sort_index()        # 默认对列索引升序排列

   two  one

a    4  7.0

b    3  6.0

c    1  NaN

d    2  5.0

>>> df.sort_index(ascending=False)   # 对列索引降序排列

   two  one

d    2  5.0

c    1  NaN

b    3  6.0

a    4  7.0

>>> df.sort_index(axis=1)      # 对行索引升序排列

   one  two      # o排在t前面

c  NaN    1

d  5.0    2

b  6.0    3

a  7.0    4

>>> df.sort_index(ascending=False,axis=1)   # 对行索引降序排列

   two  one

c    1  NaN

d    2  5.0

b    3  6.0

a    4  7.0

# sort_values(by,axis,ascending):按某一列（行）的值排序

>>> df.sort_values(by='two')   # 按two这一列排序

   two  one

c    1  NaN

d    2  5.0

b    3  6.0

a    4  7.0

>>> df.sort_values(by='two', ascending=False)   # ascending默认升序，改为False即为降序

   two  one

a    4  7.0

b    3  6.0

d    2  5.0

c    1  NaN

>>> df.sort_values(by='a',ascending=False,axis=1)  # 按a行降序排序，注意是按值排序

   one  two

c  NaN    1

d  5.0    2

b  6.0    3

a  7.0    4

# 按列排序，有缺失值的默认放在最后

>>> df.sort_values(by='one')

   two  one

d    2  5.0

b    3  6.0

a    4  7.0

c    1  NaN

>>> df.sort_values(by='one', ascending=False)

   two  one

a    4  7.0

b    3  6.0

d    2  5.0

c    1  NaN

　　注意：NumPy的通用函数同样适用于pandas。

五、pandas——时间序列处理

　　时间序列类型：

　　（1）时间戳：特定时刻

　　（2）固定时期：如2017年7月

　　（3）时间间隔：起始时间——结束时间

1、时间对象处理

　　python标准库处理时间对象：datetime模块。datetime模块的datetime类中有一个方法strptime()，可以将字符串解析为时间对象。

>>> import datetime

>>> datetime.datetime.strptime('2010-01-01', '%Y-%m-%d')

datetime.datetime(2010, 1, 1, 0, 0)

(1)灵活处理时间对象——dateutil

>>> import dateutil

>>> dateutil.parser.parse('2001-01-01')    # 用-分隔

datetime.datetime(2001, 1, 1, 0, 0)

>>> dateutil.parser.parse('2001/01/01')    # 用/分隔

datetime.datetime(2001, 1, 1, 0, 0)

>>> dateutil.parser.parse('02/03/2001')    # 年份放在后面也可以识别

datetime.datetime(2001, 2, 3, 0, 0)

>>> dateutil.parser.parse('2001-JAN-01')   # 识别英文月份

datetime.datetime(2001, 1, 1, 0, 0)

(2)成组处理时间对象——pandas

　　通常被用来做索引。

>>> pd.to_datetime(['2001-01-01','2010/Feb/02'])   # 不同格式均自动转化为DatetimeIndex

DatetimeIndex(['2001-01-01', '2010-02-02'], dtype='datetime64[ns]', freq=None)

2、时间对象生成

　　pandas中date_range函数如下所示：

def date_range(start=None, end=None, periods=None, freq=None, tz=None,

               normalize=False, name=None, closed=None, **kwargs):

    """

    Return a fixed frequency DatetimeIndex.

    Parameters

    ----------

    start : str or datetime-like, optional    开始时间

        Left bound for generating dates.

    end : str or datetime-like, optional      结束时间

        Right bound for generating dates.

    periods : integer, optional               时间长度

        Number of periods to generate.

    freq : str or DateOffset, default 'D'     时间频率，默认为'D'，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es),S(econd),A(years),...

        Frequency strings can have multiples, e.g. '5H'. See

        :ref:`here <timeseries.offset_aliases>` for a list of

        frequency aliases.

    tz : str or tzinfo, optional

        Time zone name for returning localized DatetimeIndex, for example

        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is

        timezone-naive.

    normalize : bool, default False

        Normalize start/end dates to midnight before generating date range.

    name : str, default None

        Name of the resulting DatetimeIndex.

    closed : {None, 'left', 'right'}, optional

        Make the interval closed with respect to the given frequency to

        the 'left', 'right', or both sides (None, the default).

    **kwargs

        For compatibility. Has no effect on the result.

    """

　　使用示例如下所示：

>>> pd.date_range('2010-01-01','2010-5-1')                    # 设置起始时间和结束时间

DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',

               '2010-01-09', '2010-01-10',

               ...

               '2010-04-26', '2010-04-27', '2010-04-28', '2010-04-29',

               '2010-04-30', '2010-05-01'],

              dtype='datetime64[ns]', length=121, freq='D')

>>> pd.date_range('2010-01-01', periods=10)                   # 指定起始和长度

DatetimeIndex(['2010-01-01', '2010-01-02', '2010-01-03', '2010-01-04',

               '2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',

               '2010-01-09', '2010-01-10'],

              dtype='datetime64[ns]', freq='D')

>>> pd.date_range('2010-01-01', periods=10, freq='H')         # 指定频率为每小时

DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:00:00',

               '2010-01-01 02:00:00', '2010-01-01 03:00:00',

               '2010-01-01 04:00:00', '2010-01-01 05:00:00',

               '2010-01-01 06:00:00', '2010-01-01 07:00:00',

               '2010-01-01 08:00:00', '2010-01-01 09:00:00'],

              dtype='datetime64[ns]', freq='H')

>>> pd.date_range('2010-01-01', periods=10, freq='W-MON')     # 指定频率为每周一

DatetimeIndex(['2010-01-04', '2010-01-11', '2010-01-18', '2010-01-25',

               '2010-02-01', '2010-02-08', '2010-02-15', '2010-02-22',

               '2010-03-01', '2010-03-08'],

              dtype='datetime64[ns]', freq='W-MON')

>>> pd.date_range('2010-01-01', periods=10, freq='B')         # 指定频率为工作日

DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',

               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',

               '2010-01-13', '2010-01-14'],

              dtype='datetime64[ns]', freq='B')

>>> pd.date_range('2010-01-01', periods=10, freq='1h20min')   # 间隔一小时二十分钟

DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:20:00',

               '2010-01-01 02:40:00', '2010-01-01 04:00:00',

               '2010-01-01 05:20:00', '2010-01-01 06:40:00',

               '2010-01-01 08:00:00', '2010-01-01 09:20:00',

               '2010-01-01 10:40:00', '2010-01-01 12:00:00'],

              dtype='datetime64[ns]', freq='80T')

# 转换为datetime对象

>>> dt = pd.date_range('2010-01-01', periods=10, freq='B')

>>> dt[0]

Timestamp('2010-01-01 00:00:00', freq='B')

>>> dt[0].to_pydatetime()    # 转换为python的datetime对象

datetime.datetime(2010, 1, 1, 0, 0)

3、pandas——时间序列

　　时间序列就是以时间对象作为索引的Series或DataFrame。

　　datetime对象作为索引时是存储在DatetimeIndex对象中的。

　　时间序列的特殊功能：

>>> sr = pd.Series(np.arange(1000),index=pd.date_range('2017-01-01', periods=1000))

>>> sr

2017-01-01      0

2017-01-02      1

2017-01-03      2

2017-01-04      3

...

2019-09-26    998

2019-09-27    999

Freq: D, Length: 1000, dtype: int64

# 功能一：传入"年"或"年月"作为切片方式

>>> sr['2017']      # 传入年切片

2017-01-01      0

2017-01-02      1

...

2017-12-30    363

2017-12-31    364

Freq: D, Length: 365, dtype: int64

>>> sr['2017-05']   # 传入年月切片

2017-05-01    120

2017-05-02    121

...

2017-05-30    149

2017-05-31    150

Freq: D, dtype: int64

# 功能二：传入日期范围作为切片方式

>>> sr['2017-10-25':'2018-03']   # 2017年10月25日到2018年3月

2017-10-25    297

2017-10-26    298

...

2018-03-30    453

2018-03-31    454

Freq: D, Length: 158, dtype: int64

# 功能三：丰富的函数支持：resample()、truncate()....

# resample()重新采样函数

>>> sr.resample('W').sum()   # 每一周的合

2017-01-01       0

2017-01-08      28

...

2019-09-22    6937

2019-09-29    4985

Freq: W-SUN, Length: 144, dtype: int64

>>> sr.resample('M').sum()   # 每个月的合

2017-01-31      465

2017-02-28     1246

...

2019-08-31    29667

2019-09-30    26622

Freq: M, dtype: int64

>>> sr.resample('M').mean()   # 每个月每天的平均值

2017-01-31     15.0

2017-02-28     44.5

2017-03-31     74.0

...

2019-08-31    957.0

2019-09-30    986.0

Freq: M, dtype: float64

# truncate()截断

>>> sr.truncate(before='2018-04-01')   # 截断掉2018年4月1日之前的部分

2018-04-01    455

2018-04-02    456

...

2019-09-26    998

2019-09-27    999

Freq: D, Length: 545, dtype: int64

>>> sr.truncate(after='2018-01-01')    # 截断掉2018年1月1日之后的部分

2017-01-01      0

2017-01-02      1

2017-01-03      2

...

2017-12-31    364

2018-01-01    365

Freq: D, Length: 366, dtype: int64

六、pandas——文件处理

　　数据文件常用格式：csv（以某间隔符分隔数据）。

　　pandas除了支持csv格式，还支持其他文件类型如：json、XML、HTML、数据库、pickle、excel....

1、pandas读取文件

　　从文件名、URL、文件对象中加载数据。

(1)read_csv:默认分隔符为逗号

>>> pd.read_csv('601318.csv')   # 将原来的索引标识为unnamed，重新生成一列索引

      Unnamed: 0        date    open  ...     low      volume    code

0              0  2007-03-01  21.878  ...  20.040  1977633.51  601318

1              1  2007-03-02  20.565  ...  20.075   425048.32  601318

2              2  2007-03-05  20.119  ...  19.047   419196.74  601318

          ...         ...     ...  ...     ...         ...     ...

2561        2561  2017-12-14  72.120  ...  70.600   676186.00  601318

2562        2562  2017-12-15  70.690  ...  70.050   735547.00  601318

[2563 rows x 8 columns]

>>> pd.read_csv('601318.csv',index_col=0)   # 将第0列作为索引

            date    open   close    high     low      volume    code

0     2007-03-01  21.878  20.473  22.302  20.040  1977633.51  601318

1     2007-03-02  20.565  20.307  20.758  20.075   425048.32  601318

          ...     ...     ...     ...     ...         ...     ...

2561  2017-12-14  72.120  71.010  72.160  70.600   676186.00  601318

2562  2017-12-15  70.690  70.380  71.440  70.050   735547.00  601318

[2563 rows x 7 columns]

>>> pd.read_csv('601318.csv',index_col='date')   # 将date那一列作为索引

            Unnamed: 0    open   close    high     low      volume    code

date

2007-03-01           0  21.878  20.473  22.302  20.040  1977633.51  601318

2007-03-02           1  20.565  20.307  20.758  20.075   425048.32  601318

                ...     ...     ...     ...     ...         ...     ...

2017-12-14        2561  72.120  71.010  72.160  70.600   676186.00  601318

2017-12-15        2562  70.690  70.380  71.440  70.050   735547.00  601318

[2563 rows x 7 columns]

# 需要注意：上面虽然是有时间日期作为索引，但实际不是时间对象而是字符串

>>> df = pd.read_csv('601318.csv',index_col='date')

>>> df.index

Index(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06', '2007-03-07',

       '2007-03-08', '2007-03-09', '2007-03-12', '2007-03-13', '2007-03-14',

       ...

       '2017-12-04', '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08',

       '2017-12-11', '2017-12-12', '2017-12-13', '2017-12-14', '2017-12-15'],

      dtype='object', name='date', length=2563)

# 转换为时间对象的方法：

# 方法一：

>>> df = pd.read_csv('601318.csv',index_col='date', parse_dates=True)  # 解释表中所有能解释为时间序列的列

>>> df

            Unnamed: 0    open   close    high     low      volume    code

date

2007-03-01           0  21.878  20.473  22.302  20.040  1977633.51  601318

2007-03-02           1  20.565  20.307  20.758  20.075   425048.32  601318

                ...     ...     ...     ...     ...         ...     ...

2017-12-14        2561  72.120  71.010  72.160  70.600   676186.00  601318

2017-12-15        2562  70.690  70.380  71.440  70.050   735547.00  601318

[2563 rows x 7 columns]

>>> df.index   # 查看索引，可以发现已转换为Datetime

DatetimeIndex(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06',

               '2007-03-07', '2007-03-08', '2007-03-09', '2007-03-12',

               ...

               '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13',

               '2017-12-14', '2017-12-15'],

              dtype='datetime64[ns]', name='date', length=2563, freq=None)

# 方法二：

>>> df = pd.read_csv('601318.csv',index_col='date', parse_dates=['date'])  # parse_dates也可以传列表,指定哪些列转换

>>> df.index

DatetimeIndex(['2007-03-01', '2007-03-02', '2007-03-05', '2007-03-06',

               '2007-03-07', '2007-03-08', '2007-03-09', '2007-03-12',

               ...

               '2017-12-08', '2017-12-11', '2017-12-12', '2017-12-13',

               '2017-12-14', '2017-12-15'],

              dtype='datetime64[ns]', name='date', length=2563, freq=None)

# header参数为None：指定文件无列名，可自动生成数字列名

>>> pd.read_csv('601318.csv',header=None)

           0           1       2       3       4       5           6       7   # 新列名

0        NaN        date    open   close    high     low      volume    code

1        0.0  2007-03-01  21.878  20.473  22.302   20.04  1977633.51  601318

2        1.0  2007-03-02  20.565  20.307  20.758  20.075   425048.32  601318

      ...         ...     ...     ...     ...     ...         ...     ...

2562  2561.0  2017-12-14   72.12   71.01   72.16    70.6    676186.0  601318

2563  2562.0  2017-12-15   70.69   70.38   71.44   70.05    735547.0  601318

[2564 rows x 8 columns]

# 还可用names参数指定列名

>>> pd.read_csv('601318.csv',header=None, names=list('abcdefgh'))

           a           b       c       d       e       f           g       h

0        NaN        date    open   close    high     low      volume    code

1        0.0  2007-03-01  21.878  20.473  22.302   20.04  1977633.51  601318

2        1.0  2007-03-02  20.565  20.307  20.758  20.075   425048.32  601318

      ...         ...     ...     ...     ...     ...         ...     ...

2562  2561.0  2017-12-14   72.12   71.01   72.16    70.6    676186.0  601318

2563  2562.0  2017-12-15   70.69   70.38   71.44   70.05    735547.0  601318

[2564 rows x 8 columns]

(2)read_table:默认分隔符为制表符

　　read_table使用方法和read_csv基本相同。

(3)read_csv、read_table函数主要参数

# sep:指定分隔符，可用正则表达式如'\s+'

# header=None:指定文件无列名

# name:指定列名

# index_col:指定某里列作为索引

# skip_row:指定跳过某些行

>>> pd.read_csv('601318.csv',header=None, skiprows=[1,2,3])   # 跳过1\2\3这三行

           0           1       2       3       4       5          6       7

0        NaN        date    open   close    high     low     volume    code

1        3.0  2007-03-06  19.253    19.8  20.128  19.143  297727.88  601318

2        4.0  2007-03-07  19.817  20.338  20.522  19.651  287463.78  601318

3        5.0  2007-03-08  20.171  20.093  20.272  19.988  130983.83  601318

      ...         ...     ...     ...     ...     ...        ...     ...

2559  2561.0  2017-12-14   72.12   71.01   72.16    70.6   676186.0  601318

2560  2562.0  2017-12-15   70.69   70.38   71.44   70.05   735547.0  601318

[2561 rows x 8 columns]

# na_values:指定某些字符串表示缺失值

# 如果某些值是NaN能识别是缺失值，但如果是None则识别为字符串

>>> pd.read_csv('601318.csv',header=None, na_values=['None'])   # 将None字符串解释为缺失值

           0           1       2       3       4       5           6       7

0        NaN        date    open   close    high     low      volume    code

1        0.0  2007-03-01  21.878     NaN  22.302   20.04  1977633.51  601318

2        1.0  2007-03-02  20.565     NaN  20.758  20.075   425048.32  601318

      ...         ...     ...     ...     ...     ...         ...     ...

2561  2560.0  2017-12-13   71.21   72.12   72.62    70.2    865117.0  601318

2562  2561.0  2017-12-14   72.12   71.01   72.16    70.6    676186.0  601318

2563  2562.0  2017-12-15   70.69   70.38   71.44   70.05    735547.0  601318

[2564 rows x 8 columns]

# parse_dates:指定某些列是否被解析为日期，类型为布尔值或列表

2、pandas写入csv文件

　　写入到csv文件：to_csv函数。

>>> df = pd.read_csv('601318.csv',index_col=0)

>>> df.iloc[0,0]=np.nan     # 第0行第0列改为NaN

# 写入新文件

>>> df.to_csv('test.csv')

# 写入文件函数的主要参数

# sep:指定文件分隔符

# header=False:不输出列名一行

>>> df.to_csv('test.csv', header=False)

# index=False:不输出行索引一行

>>> df.to_csv('test.csv', index=False)

# na_rep:指定缺失值转换的字符串，默认为空字符串

>>> df.to_csv('test.csv', header=False, index=False, na_rep='null')   # 空白处填写null

# columns:指定输出的列，传入列表

>>> df.to_csv('test.csv', header=False, index=False, na_rep='null', columns=[0,1,2,3])  # 输出前四列

3、pandas写入及读取其他格式文件

>>> df.to_html('test.html')   # 以html格式写入文件

>>> df.to_json('test.json')   # 以json格式写入文件

>>> pd.read_json('test.json')   # 读取json格式文件

      Unnamed: 0       date    open   close    high     low      volume    code

0              0 2007-03-01  21.878    None  22.302  20.040  1977633.51  601318

1              1 2007-03-02  20.565    None  20.758  20.075   425048.32  601318

          ...        ...     ...     ...     ...     ...         ...     ...

998          998 2011-07-07  22.438  21.985  22.465  21.832   230480.00  601318

999          999 2011-07-08  22.076  21.936  22.212  21.850   141415.00  601318

[2563 rows x 8 columns]

>>> pd.read_html('test.html')   # 读取html格式文件

[      Unnamed: 0  Unnamed: 0.1        date  ...     low      volume    code

0              0             0  2007-03-01  ...  20.040  1977633.51  601318

1              1             1  2007-03-02  ...  20.075   425048.32  601318

          ...           ...         ...  ...     ...         ...     ...

2561        2561          2561  2017-12-14  ...  70.600   676186.00  601318

2562        2562          2562  2017-12-15  ...  70.050   735547.00  601318

[2563 rows x 9 columns]]