Python 的 pandas 实践

Python 的 pandas 实践：

 # !/usr/bin/env python

 # encoding: utf-8

 __author__ = 'Administrator'

 import pandas as pd

 import numpy as np

 import matplotlib.pyplot as plt

 #一、创建对象

 #1. 通过传递一个list对象来创建一个Series，pandas会默认创建整型索引：

 s=pd.Series([1,3,4,np.nan,6,8])

 print(s)

 # 0    1.0

 # 1    3.0

 # 2    4.0

 # 3    NaN

 # 4    6.0

 # 5    8.0

 # dtype: float64

 #2.通过传递一个numpy array，时间索引以及列标签来创建一个DataFrame：

 dates=pd.date_range('',periods=6)

 print(dates)

 # DatetimeIndex(['2018-03-01', '2018-03-02', '2018-03-03', '2018-03-04',

 #                '2018-03-05', '2018-03-06'],

 #               dtype='datetime64[ns]', freq='D')

 df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

 # numpy.random.randn(d0, d1, …, dn)是从标准正态分布中返回一个或多个样本值。(可含负数）

 # numpy.random.rand(d0, d1, …, dn)的随机样本位于[0, 1)中。

 #P=numpy.random.rand(N,K) #随机生成一个 N行 K列的矩阵

 print(df)

 #                    A         B         C         D

 # 2018-03-01 -0.451506 -0.884044 -0.916664 -0.763684

 # 2018-03-02 -0.463568  0.340688 -0.077484 -0.237660

 # 2018-03-03 -1.533427  0.301283  0.268640 -0.011027

 # 2018-03-04  1.036050  0.402203  0.485365  2.086525

 # 2018-03-05  0.221578 -0.821756 -0.265241  0.277563

 # 2018-03-06  1.774195 -0.288553  1.527936  0.119153

 # '''

 #3.通过传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame：

 df2=pd.DataFrame({

     'A':1.,

     'B':pd.Timestamp(''),

     'C':pd.Series(1,index=list(range(4)),dtype='float32'),

     'D':np.array([3]*4,dtype='int32'),

     'E':pd.Categorical(["test","train","test","train"]),

     'F':'foo'})

 print(df2)

 #      A          B    C  D      E    F

 # 0  1.0 2018-03-01  1.0  3   test  foo

 # 1  1.0 2018-03-01  1.0  3  train  foo

 # 2  1.0 2018-03-01  1.0  3   test  foo

 # 3  1.0 2018-03-01  1.0  3  train  foo

 #4.查看不同列的数据类型：

 print(df2.dtypes)

 # A           float64

 # B    datetime64[ns]

 # C           float32

 # D             int32

 # E          category

 # F            object

 # dtype: object

 #二、查看数据

 #1. 查看dataframe中头部和尾部的行：

 print(df.head())

 #                    A         B         C         D

 # 2018-03-01 -0.250132 -1.403066  1.234990 -3.077763

 # 2018-03-02  0.387496 -0.389183  0.186663  1.124608

 # 2018-03-03 -0.105463 -0.230739 -0.227575  0.308565

 # 2018-03-04 -1.703507  0.194876  1.790366 -0.561566

 # 2018-03-05 -0.511609  0.695915  0.398392  0.107062

 print(df.tail(3))

 #                    A         B         C         D

 # 2018-03-04  0.704065  0.492649  0.533961 -1.518723

 # 2018-03-05  2.192819 -0.508099 -0.173966 -0.401864

 # 2018-03-06 -0.839634 -0.314676 -0.808266 -0.578229

 #2.显示索引、列和底层的numpy数据：

 print(df.index)

 # DatetimeIndex(['2018-03-01', '2018-03-02', '2018-03-03', '2018-03-04',

 #                '2018-03-05', '2018-03-06'],

 #               dtype='datetime64[ns]', freq='D')

 print(df.columns)

 #Index(['A', 'B', 'C', 'D'], dtype='object')

 print(df.values)

 # [[ 1.65612186 -0.47932887  0.9673593  -0.63872414]

 #  [ 0.12229686  0.08831358  1.07344126 -0.12742276]

 #  [ 0.54654075  0.77281164 -0.6396787   0.1585142 ]

 #  [-0.70695944 -2.12273423 -0.24549759 -0.09530991]

 #  [ 2.66920788  0.6520858   1.72857641 -1.34418643]

 #  [ 1.87333346 -0.42716996  0.49558928 -1.47606701]]

 #3. describe()函数对于数据的快速统计汇总：

 print(df.describe())

 #               A         B         C         D

 # count  6.000000  6.000000  6.000000  6.000000

 # mean   0.399068  0.339270  0.755588 -0.459344

 # std    0.890360  1.011113  0.851783  1.759264

 # min   -1.002101 -0.806772 -0.333761 -2.411582

 # 25%   -0.087757 -0.400563  0.338822 -1.782221

 # 50%    0.577418  0.244011  0.502612 -0.622453

 # 75%    1.096592  0.941454  1.376095  0.433235

 # max    1.281508  1.795854  1.910586  2.284103

 #4. 对数据的转置：

 print(df.T)

 #    2018-03-01  2018-03-02  2018-03-03  2018-03-04  2018-03-05  2018-03-06

 # A    0.843347   -0.906826   -0.528945    1.186650   -1.839152   -0.508169

 # B   -0.105481    2.084689   -1.106710    0.521137    0.741946    0.399700

 # C   -0.786144    0.269116   -0.180710    3.345385    1.310786   -0.204216

 # D    0.453731   -0.243617    0.701440    2.541094    1.337923   -0.673128

 #5. 按轴进行排序

 print(df.sort_index(axis=1,ascending=False)) #  axis = 0是按行进行操作, axis=1是按列进行操作;  ascending=False是只递减，否则递增

 #                    D         C         B         A

 # 2018-03-01  0.389294 -0.227394  0.649234  0.639820

 # 2018-03-02  0.680265  0.466626 -1.940228  0.843753

 # 2018-03-03  1.520800  0.570192  1.244427 -0.715080

 # 2018-03-04  0.309068 -0.224222 -0.226254  1.416381

 # 2018-03-05 -1.854131 -0.403245 -0.017054  0.840840

 # 2018-03-06 -1.991173  1.275825  0.913996  1.561550

 #6. 按值进行排序

 # print(df.sort(column='B')) #?? AttributeError: 'DataFrame' object has no attribute 'sort'

 #三、选择

 # 虽然标准的Python/Numpy的选择和设置表达式都能够直接派上用场，

 # 但是作为工程使用的代码，我们推荐使用经过优化的pandas数据访问方式： .at, .iat, .loc, .iloc 和 .ix

 #（一）获取：

 #1. 选择一个单独的列，这将会返回一个Series，等同于 df.A：

 print(df['A'])

 # 2018-03-01    0.156236

 # 2018-03-02   -0.041257

 # 2018-03-03   -0.970551

 # 2018-03-04   -1.751839

 # 2018-03-05    1.521352

 # 2018-03-06    0.828690

 # Freq: D, Name: A, dtype: float64

 #2. 通过[]进行选择，这将会对行进行切片

 print(df[0:3])

 #                    A         B         C         D

 # 2018-03-01 -0.432011  0.697033 -3.028116 -0.217882

 # 2018-03-02 -1.744071  0.647694  1.031179 -1.043985

 # 2018-03-03 -0.673125  0.689913  0.648986 -1.471825

 print(df['':''])

 #                    A         B         C         D

 # 2018-03-02 -0.803947  0.147807 -0.248534  0.496719

 # 2018-03-03 -1.518123  0.376390 -0.793349  0.612074

 # 2018-03-04  0.146634  0.506102  1.316693 -0.801691

 #（二）通过标签选择：

 #1. 使用标签来获取一个交叉的区域：

 print(df.loc[dates[0]])

 # A   -1.593039

 # B    0.400735

 # C   -0.870638

 # D   -0.551766

 # Name: 2018-03-01 00:00:00, dtype: float64

 #2. 通过标签来在多个轴上进行选择：

 print(df.loc[:,['A','B']])

 #                    A         B

 # 2018-03-01  0.326446  0.633246

 # 2018-03-02  0.169674  0.892832

 # 2018-03-03 -0.755691 -2.028912

 # 2018-03-04 -1.005360  0.529193

 # 2018-03-05 -0.457140  0.842211

 # 2018-03-06  0.343157  0.879763

 #3. 标签切片

 print(df.loc['':'',['A','B']])

 #                    A         B

 # 2018-03-02  0.197173  0.040377

 # 2018-03-03  2.064367  1.112152

 # 2018-03-04  0.888216 -0.591129

 #4. 对于返回的对象进行维度缩减

 print(df.loc['',['A','B']])

 # A   -0.259955

 # B   -0.019266

 # Name: 2018-03-02 00:00:00, dtype: float64

 #5. 获取一个标量

 print(df.loc[dates[0],'A']) #-0.313259346223

 #6. 快速访问一个标量（与上一个方法等价）

 print(df.at[dates[0],'A'])  #-0.313259346223

 #（三）通过位置选择：

 #1. 通过传递数值进行位置选择（选择的是行）

 print(df.iloc[3])

 # A    1.661488

 # B   -1.175748

 # C    0.642823

 # D   -0.491914

 # Name: 2018-03-04 00:00:00, dtype: float64

 #2. 通过数值进行切片，与numpy/python 中的情况类似

 print(df.iloc[3:5,0:2]) #选择第3、第4行，第1、第2列

 #                    A         B

 # 2018-03-04  0.492426  0.412712

 # 2018-03-05  0.541252 -0.009380

 #3. 通过制定一个位置的列表，与numpy/python中的情况类似

 print(df.iloc[[1,2,4],[0,2]])

 #                    A         C

 # 2018-03-02 -0.638074  1.794516

 # 2018-03-03 -0.403471 -0.934373

 # 2018-03-05 -1.309320  1.353276

 #4. 对行进行切片

 print(df.iloc[1:3,:])

 #                    A         B         C         D

 # 2018-03-02  1.980513 -0.218688  2.627449  1.314947

 # 2018-03-03 -0.532379  1.382092 -1.270961  0.722475

 #5. 对列进行切片

 print(df.iloc[:,1:3])

 #                    B         C

 # 2018-03-01  0.332228 -1.682811

 # 2018-03-02 -0.533398 -0.254960

 # 2018-03-03 -0.926688  0.890513

 # 2018-03-04 -0.448742  0.763850

 # 2018-03-05 -0.841622  0.514873

 # 2018-03-06 -1.346557  1.516414

 #6. 获取特定的值

 print(df.iloc[1,1]) #0.481882236461

 print(df.iat[1,1]) #0.481882236461

 #（四）布尔索引：

 #1. 使用一个单独列的值来选择数据：

 print(df[df.A>0])

 #                    A         B         C         D

 # 2018-03-02  0.566243  1.510954 -0.898180  0.856439

 # 2018-03-03  1.008447 -1.597226 -0.665134 -0.287472

 # 2018-03-05  0.952498 -0.144979  0.620468 -0.830652

 #2. 使用where操作来选择数据：

 print(df[df>0])

 #                    A         B         C         D

 # 2018-03-01  0.892660       NaN       NaN       NaN

 # 2018-03-02  1.512600       NaN       NaN  1.375527

 # 2018-03-03  0.970026  1.184603  1.182990       NaN

 # 2018-03-04  1.913993       NaN  0.914778  0.137170

 # 2018-03-05  0.482589       NaN       NaN  0.668817

 # 2018-03-06       NaN  0.539344  0.142892       NaN

 #3. 使用isin()方法来过滤：

 df2=df.copy()

 df2['E']=['one','one','two','three','four','three']

 print(df2)

 #                    A         B         C         D      E

 # 2018-03-01 -1.138724  0.566583  0.338254  2.072839    one

 # 2018-03-02 -0.366949  0.335546  1.653024  1.445071    one

 # 2018-03-03  0.724615  1.715933 -0.754757 -1.452252    two

 # 2018-03-04 -0.881962 -0.173858 -0.340868 -0.556665  three

 # 2018-03-05 -2.126513 -0.113010 -0.796566  0.210673   four

 # 2018-03-06  0.716490  0.223395 -1.428238  0.328406  three

 print(df2[df2['E'].isin(['two','four'])])

 #                    A         B         C         D     E

 # 2018-03-03 -0.737833 -1.161520  0.897204 -0.029158   two

 # 2018-03-05  1.072054  1.234587  0.935680 -1.284542  four

 #（五）设置：

 #1. 设置一个新的列：

 s1=pd.Series([1,2,3,4,5,6],index=pd.date_range('',periods=6))

 print(s1)

 # 2018-03-02    1

 # 2018-03-03    2

 # 2018-03-04    3

 # 2018-03-05    4

 # 2018-03-06    5

 # 2018-03-07    6

 # Freq: D, dtype: int64

 df['F']=s1

 print(df)

 #                    A         B         C         D    F

 # 2018-03-01  2.413592 -0.336264  0.165597  2.143270  NaN

 # 2018-03-02 -1.921596 -2.100707 -0.454461  0.563247  1.0

 # 2018-03-03 -0.235034 -0.517009 -2.409731 -0.711854  2.0

 # 2018-03-04  0.667604 -0.838737 -0.425916 -0.238519  3.0

 # 2018-03-05  1.057415  1.457143  0.440690  0.948613  4.0

 # 2018-03-06  0.539187 -0.952633  0.316752  0.422146  5.0

 #2. 通过标签设置新的值：

 df.at[dates[0],'A']=0

 #3. 通过位置设置新的值：

 df.iat[0,1]=0

 #4. 通过一个numpy数组设置一组新值：

 df.loc[:,'D']=np.array([5]*len(df))

 print(df)

 #                    A         B         C  D    F

 # 2018-03-01  0.000000  0.000000  0.164267  5  NaN

 # 2018-03-02  0.614534 -0.865975 -0.977389  5  1.0

 # 2018-03-03 -0.253095 -1.451951  2.360233  5  2.0

 # 2018-03-04  0.143115  0.363544  1.587648  5  3.0

 # 2018-03-05  0.010932  0.802590 -1.701589  5  4.0

 # 2018-03-06 -0.354579  0.830066  0.404646  5  5.0

 #5. 通过where操作来设置新的值：

 df2=df.copy()

 df2[df2>0]=-df2

 print(df2)

 #                    A         B         C  D    F

 # 2018-03-01  0.000000  0.000000 -1.385454 -5  NaN

 # 2018-03-02 -0.773506 -0.444692 -0.620307 -5 -1.0

 # 2018-03-03 -0.506590 -2.445527 -0.664229 -5 -2.0

 # 2018-03-04 -0.568711 -0.709224 -2.582502 -5 -3.0

 # 2018-03-05 -1.074985 -2.480905 -0.537869 -5 -4.0

 # 2018-03-06 -2.659346 -1.055430 -0.379758 -5 -5.0

 #四、缺失值处理

 # 在pandas中，使用np.nan来代替缺失值，这些值将默认不会包含在计算中，详情请参阅：Missing Data Section。

 #1. reindex()方法可以对指定轴上的索引进行改变/增加/删除操作，这将返回原始数据的一个拷贝：

 df1=df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])

 df1.loc[dates[0]:dates[1],'E']=1

 print(df1)

 #                    A         B         C         D    E

 # 2018-03-01 -0.275255 -0.290044  0.707118  1.094318  1.0

 # 2018-03-02 -1.340747  0.633546 -0.911210 -0.275105  1.0

 # 2018-03-03 -1.044219  0.659945  1.370910  0.262282  NaN

 # 2018-03-04 -0.015582  1.540852 -0.792882 -0.380751  NaN

 #2. 去掉包含缺失值的行：

 # df1=df1.dropna(how='any')

 # print(df1)

 # #                    A         B         C         D    E

 # 2018-03-01 -0.914568  0.784980 -1.698139 -0.096874  1.0

 # 2018-03-02 -0.410249 -0.494166  0.932946 -0.467547  1.0

 #3. 对缺失值进行填充：

 df1=df1.fillna(value=5)

 print(df1)

 #                    A         B         C         D    E

 # 2018-03-01 -1.265605  0.778767 -0.947968 -1.330982  1.0

 # 2018-03-02  1.778973 -1.428542  1.257860  0.362724  1.0

 # 2018-03-03 -1.589094 -0.517478 -0.164942 -0.507224  5.0

 # 2018-03-04  2.363145  2.089114 -0.081683 -0.184851  5.0

 #4.对数据进行布尔填充

 df1=pd.isnull(df1)

 print(df1)

 #                 A      B      C      D      E

 # 2018-03-01  False  False  False  False  False

 # 2018-03-02  False  False  False  False  False

 # 2018-03-03  False  False  False  False  False

 # 2018-03-04  False  False  False  False  False

 #五、相关操作

 # （一）统计（相关操作通常情况下不包括缺失值）

 # #1. 执行描述性统计：

 print(df.mean())

 # A   -0.066441

 # B    0.154609

 # C   -0.154372

 # D   -0.155221

 # dtype: float64

 #2. 在其他轴上进行相同的操作：

 print(df.mean(1))

 # 2018-03-01   -0.138352

 # 2018-03-02   -0.226558

 # 2018-03-03    0.121705

 # 2018-03-04    0.855662

 # 2018-03-05   -0.892621

 # 2018-03-06    0.062726

 # Freq: D, dtype: float64

 #3.对于拥有不同维度，需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播：

 # （二）Apply

 #1. 对数据应用函数：

 print(df)

 print(df.apply(np.cumsum))

 #                    A         B         C         D

 # 2018-03-01 -0.381460 -0.296346  1.229803 -1.300226

 # 2018-03-02  0.365891  0.974026  1.570268 -2.572981

 # 2018-03-03  0.624070  0.211935  0.635084 -1.110378

 # 2018-03-04  2.945062 -0.406832 -0.043918 -0.470773

 # 2018-03-05  3.542080  0.092974 -1.585544 -0.658267

 # 2018-03-06  3.440084  0.448828 -2.400617 -0.734055

 print(df.apply(lambda x:x.max()-x.min()))

 # A    2.702452

 # B    2.032463

 # C    2.771429

 # D    2.762828

 # dtype: float64

 # （三）直方图

 s=pd.Series(np.random.randint(0,7,size=10))

 print(s)

 # 0    2

 # 1    6

 # 2    6

 # 3    3

 # 4    3

 # 5    4

 # 6    4

 # 7    6

 # 8    6

 # 9    2

 # dtype: int32

 print(s.value_counts())

 # 6    4

 # 4    2

 # 3    2

 # 2    2

 # dtype: int64

 # （四）字符串方法

 # Series对象在其str属性中配备了一组字符串处理方法，可以很容易的应用到数组中的每个元素，如下段代码所示。

 s=pd.Series(['A','B','C','Aaba','Baca',np.nan,'CABA','dog','cat'])

 print(s.str.lower())

 # 0       a

 # 1       b

 # 2       c

 # 3    aaba

 # 4    baca

 # 5     NaN

 # 6    caba

 # 7     dog

 # 8     cat

 # dtype: object

 #六、合并

 #Pandas提供了大量的方法能够轻松的对Series，DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作。

 #1、Concat

 df=pd.DataFrame(np.random.randn(10,4))

 print(df)

 #           0         1         2         3

 # 0  0.620744 -0.921194  0.130483 -0.305914

 # 1  0.311699 -0.085041  0.638297 -0.077868

 # 2  0.327473 -0.732598 -0.134463  0.498805

 # 3 -0.622715 -0.819375 -0.473504 -0.379117

 # 4 -1.309207 -0.794917 -1.284665  0.830677

 # 5 -1.170121 -2.063048 -0.836381  0.925829

 # 6 -0.766342  0.454018 -0.181846 -1.052607

 # 7 -0.996856  0.189226  0.428375 -1.149523

 # 8  1.080517  1.884718 -0.065141 -0.781686

 # 9  0.087353  0.209678 -1.333989  0.863220

 #break it into pieces

 pieces=[df[:3],df[3:7],df[7:]]

 print(pieces)

 print(pd.concat(pieces))

 #           0         1         2         3

 # 0  1.187009 -0.493550  0.777065  1.494107

 # 1 -0.915190  1.228669  0.216910  1.610432

 # 2 -0.647737  1.961472  1.369682 -1.195257

 # 3  1.474973  1.968576  1.282678 -1.798167

 # 4  1.449858 -1.828631 -0.217424  0.992141

 # 5 -1.056223  0.464964  0.135468  0.181781

 # 6 -1.677772  1.456419  0.642563 -0.895238

 # 7  0.123780  0.030988  1.960217  0.140918

 # 8  1.071418  1.737486 -0.170948  0.859271

 # 9 -0.056640 -1.439686 -0.358960 -1.765060

 #2、Join .类似于SQL类型的合并。

 left=pd.DataFrame({'key':['foo','foo'],'lval':[1,2]})

 print(left)

 #    key  lval

 # 0  foo     1

 # 1  foo     2

 right=pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

 print(right)

 #    key  rval

 # 0  foo     4

 # 1  foo     5

 pd1=pd.merge(left,right,on='key')

 print(pd1)

 #    key  lval  rval

 # 0  foo     1     4

 # 1  foo     1     5

 # 2  foo     2     4

 # 3  foo     2     5

 #3、Append。将一行连接到一个DataFrame上。

 df=pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])

 print(df)

 #           A         B         C         D

 # 0  0.205671 -1.236797 -1.127111  1.422836

 # 1  0.646151  0.202197 -0.160218 -0.839145

 # 2  1.479783 -0.678455  0.649959 -1.085791

 # 3 -0.851987 -0.821248  0.125836  0.819543

 # 4 -1.312988 -0.898903 -0.420592  1.672173

 # 5  0.240516 -0.711331 -0.717536  0.620066

 # 6 -0.442280  0.539277 -1.428910  1.060193

 # 7  0.257239 -2.034086  1.121833  1.518571

 s=df.iloc[3]

 df1=df.append(s,ignore_index=True)

 print(df1)

 #           A         B         C         D

 # 0  0.205671 -1.236797 -1.127111  1.422836

 # 1  0.646151  0.202197 -0.160218 -0.839145

 # 2  1.479783 -0.678455  0.649959 -1.085791

 # 3 -0.851987 -0.821248  0.125836  0.819543

 # 4 -1.312988 -0.898903 -0.420592  1.672173

 # 5  0.240516 -0.711331 -0.717536  0.620066

 # 6 -0.442280  0.539277 -1.428910  1.060193

 # 7  0.257239 -2.034086  1.121833  1.518571

 # 8 -0.851987 -0.821248  0.125836  0.819543

 #七、分组

 #对于“group by”操作，我们通常是指以下一个或多个操作步骤：

 # * （splitting）按照一些规则将数据分为不同的组；

 # * （applying）对于每组数据分别执行一个函数；

 # * （combining）将结果组合到一个数据结构中；

 df=pd.DataFrame({'A':['foo','bar','foo','bar','foo','bar','foo','foo'],

                  'B':['one','one','two','three','two','two','one','three'],

                  'C':np.random.randn(8),

                  'D':np.random.randn(8) })

 print(df)

 #      A      B         C         D

 # 0  foo    one  0.792610  0.153922

 # 1  bar    one  1.497661  0.548711

 # 2  foo    two  0.038679  1.100214

 # 3  bar  three -1.074874  0.238335

 # 4  foo    two  1.176477  1.260415

 # 5  bar    two -0.629367 -1.098556

 # 6  foo    one  0.015918 -1.646855

 # 7  foo  three -0.486434 -0.930165

 #1、分组并对每个分组执行sum函数：

 dfg=df.groupby('A').sum()

 print(dfg)

 #            C         D

 # A

 # bar -0.20658 -0.311509

 # foo  1.53725 -0.062469

 #2、通过多个列进行分组形成一个层次索引，然后执行函数：

 dfg2=df.groupby(['A','B']).sum()

 print(dfg2)

 #                   C         D

 # A   B

 # bar one    1.497661  0.548711

 #     three -1.074874  0.238335

 #     two   -0.629367 -1.098556

 # foo one    0.808528 -1.492933

 #     three -0.486434 -0.930165

 #     two    1.215156  2.360629

 #八、Reshapeing

 #1、Stack

 tuples=list(zip(*[['bar','bar','baz','baz','foo','foo','quz','quz'],

                   ['one','two','one','two','one','two','one','two']]))

 index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])

 df=pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])

 df2=df[:4]

 print(df2)

 #                      A         B

 # first second

 # bar   one     1.146806  0.413660

 #       two    -0.241280 -0.756498

 # baz   one    -0.429149 -1.598932

 #       two     0.103805 -2.092773

 stacked=df2.stack()

 print(stacked)

 # first  second

 # bar    one     A   -0.671894

 #                B    0.488440

 #        two     A   -0.085894

 #                B   -0.888060

 # baz    one     A   -0.647487

 #                B   -1.573074

 #        two     A    0.084324

 #                B   -0.216785

 # dtype: float64

 stacked0=stacked.unstack()

 print(stacked0)

 #                      A         B

 # first second

 # bar   one    -2.281352  0.683124

 #       two    -2.555841  0.020481

 # baz   one     1.007699 -0.605463

 #       two     1.177308  0.833826

 stacked1=stacked.unstack(1)

 print(stacked1)

 # second        one       two

 # first

 # bar   A -2.281352 -2.555841

 #       B  0.683124  0.020481

 # baz   A  1.007699  1.177308

 #       B -0.605463  0.833826

 stacked2=stacked.unstack(0)

 print(stacked2)

 # first          bar       baz

 # second

 # one    A -0.279379  0.011654

 #        B  0.713347  0.482510

 # two    A -0.980093  0.536366

 #        B -0.378279 -1.023949

 #2、数据透视表

 df=pd.DataFrame({'A':['one','one','two','three']*3,

                  'B':['A','B','C']*4,

                  'C':['foo','foo','foo','bar','bar','bar']*2,

                  'D':np.random.randn(12),

                  'E':np.random.randn(12) })

 print(df)

 #         A  B    C         D         E

 # 0     one  A  foo -1.037929 -0.967839

 # 1     one  B  foo  0.143201  1.936801

 # 2     two  C  foo -1.108452  1.350176

 # 3   three  A  bar  0.696497  0.578974

 # 4     one  B  bar -1.206393  1.218049

 # 5     one  C  bar -0.814728  0.440277

 # 6     two  A  foo -2.039865 -1.298114

 # 7   three  B  foo -0.155810 -0.249138

 # 8     one  C  foo -0.436593  0.548266

 # 9     one  A  bar -2.236853 -1.218478

 # 10    two  B  bar -0.542738 -1.018322

 # 11  three  C  bar -0.657995 -0.772053

 #可以从这个数据中轻松的生成数据透视表：

 pdtable=pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])

 print(pdtable)

 # C             bar       foo

 # A     B

 # one   A  0.878124  0.739554

 #       B  1.508778 -0.261956

 #       C  0.452780  0.850025

 # three A -0.616593       NaN

 #       B       NaN -0.924248

 #       C -0.778909       NaN

 # two   A       NaN -0.249317

 #       B  0.341066       NaN

 #       C       NaN  0.706030

 # '''

 #九、时间序列

 #Pandas在对频率转换进行重新采样时拥有简单、强大且高效的功能（如将按秒采样的数据转换为按5分钟为单位进行采样的数据）。这种操作在金融领域非常常见。

 # rng=pd.date_range('1/1/2018',periods=100,freq='S')

 # ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)

 # ts0=ts.resample('5Min',how='sum')

 # ........

 # ........

 #十、Categorical

 #从0.15版本开始，pandas可以在DataFrame中支持Categorical类型的数据

 #1、将原始的grade转换为Categorical数据类型：

 # ........

 # ........

 #十一、画图

 ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2018',periods=1000))

 ts=ts.cumsum()

 ts.plot()

 # ........

 # ........

 #十二、导入和保存数据

 #（一）CSV

 #1、写入 csv文件

 df.to_csv('foo.csv')

 #2、从CSV文件中读取：

 pd.read_csv('foo.csv')

 #（二）HDF5

 #1、

 # ........

 # ........

 #（三）Excel

 #1、写入excel文件：

 df.to_excel('foo.xlsx',sheet_name='Sheet1')

 #2、从excel文件中读取：

 pd.read_excel('foo.xlsx','Sheet1',index_col=None,na_values=['NA'])

【Reference】
1、十分钟搞定pandas
2、10 Minutes to pandas

Python 的 pandas 实践的更多相关文章

Python之Pandas中Series、DataFrame实践
Python之Pandas中Series.DataFrame实践 1. pandas的数据结构Series 1.1 Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一 ...
Python之Pandas中Series、DataFrame
Python之Pandas中Series.DataFrame实践 1. pandas的数据结构Series 1.1 Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一 ...
Python之NumPy实践之数组和矢量计算
Python之NumPy实践之数组和矢量计算 1. NumPy(Numerical Python)是高性能科学技术和数据分析的基础包. 2. NumPy的ndarray:一种对位数组对象.NumPy最 ...
《Python机器学习及实践：从零开始通往Kaggle竞赛之路》
<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代 ...
Python机器学习及实践_从零开始通往KAGGLE竞赛之路PDF高清完整版免费下载|百度云盘|Python基础教程免费电子书
点击获取提取码:i5nw Python机器学习及实践面向所有对机器学习与数据挖掘的实践及竞赛感兴趣的读者,从零开始,以Python编程语言为基础,在不涉及大量数学模型与复杂编程知识的前提下,逐步带领读 ...
用Python的Pandas和Matplotlib绘制股票KDJ指标线
我最近出了一本书,<基于股票大数据分析的Python入门实战视频教学版>,京东链接:https://item.jd.com/69241653952.html,在其中给出了MACD,KDJ ...
Redis的Python实践，以及四中常用应用场景详解——学习董伟明老师的《Python Web开发实践》
首先,简单介绍:Redis是一个基于内存的键值对存储系统,常用作数据库.缓存和消息代理. 支持:字符串,字典,列表,集合,有序集合,位图(bitmaps),地理位置,HyperLogLog等多种数据结 ...
paip.复制文件文件操作 api的设计uapi java python php 最佳实践
paip.复制文件文件操作 api的设计uapi java python php 最佳实践 =====uapi copy() =====java的无,要自己写... ====php copy ...
Python利用pandas处理Excel数据的应用
Python利用pandas处理Excel数据的应用最近迷上了高效处理数据的pandas,其实这个是用来做数据分析的,如果你是做大数据分析和测试的,那么这个是非常的有用的!!但是其实我们平时在做 ...

随机推荐

WebService—CXF整合Spring实现接口发布和调用过程
一.CXF整合Spring实现接口发布发布过程如下: 1.引入jar包(基于maven管理)  <dependency> <groupId> ...
[14] 齿轮(Gear Wheel)图形的生成算法
顶点数据的生成 bool YfBuildGearwheelVertices ( Yreal radius, Yreal assistRadius, Yreal height, Yuint slices ...
go语言进阶之为结构体类型添加方法
1.为结构体类型添加方法示例: package main import "fmt" type Person struct { name string //名字 sex byte ...
SQL查询今天、昨天、7天内、30天【转】
SQL查询今天.昨天.7天内.30天今天的所有数据:select * from 表名 where DateDiff(dd,datetime类型字段,getdate())=0 昨天的所有数据:sele ...
IOS UITableView分组列表
UITableView有两种风格:UITableViewStylePlain和UITableViewStyleGrouped.这两者操作起来其实并没有本质区别,只是后者按分组样式显示前者按照普通样式显 ...
URAL 1748
题目大意:找出T组不大于ni(i=1,2,3,...,T)的因子数最多的数mi(i=1,2,3,...,T),有多个数时输出最小的. KB 64bit IO Format:%I64d & ...
Framework连接oracle数据库以及Cognos服务器出现错误
1:Framework连接oracle数据库时出现下面错误信息环境: win2008R2 cognos10.2.1, 服务器上已经安装oracle11.2 content manager连接的也是 ...
mssql2008R2 RCU-6083:ALTER database FWC SET READ_COMMITTED_SNAPSHOT ON
RCU-6083:失败 - 检查所选组件的先决条件要求:MDS 有关详细资料, 请参阅 E:\Setup\ofm_rcu\rcu\log\logdir.2014-11-27_12-39\rcu.log ...
MySQL监控、性能分析——工具篇
https://blog.csdn.net/leamonjxl/article/details/6431444 MySQL越来越被更多企业接受,随着企业发展,MySQL存储数据日益膨胀,MySQL的性 ...
C#.NET常见问题(FAQ)-如何清空stringbuilder
就红色的代码可以: System.Text.StringBuilder sb = new System.Text.StringBuilder(); sb.Append("hello" ...

Python 的 pandas 实践

Python 的 pandas 实践的更多相关文章

随机推荐

热门专题