Pandas是一个Python库,旨在通过“标记”和“关系”数据以完成数据整理工作,库中有两个主要的数据结构Series和DataFrame

  1. In [1]: import numpy as np
  2. In [2]: import pandas as pd
  3. In [3]: from pandas import Series,DataFrame
    In [4]: import matplotlib.pyplot as plt

本文主要说明完成数据整理的几大步骤:

1.数据来源

1)加载数据

2)随机采样

2.数据清洗

0)数据统计(贯穿整个过程)

1)处理缺失值

2)层次化索引

3)类数据库操作(增、删、改、查、连接)

4)离散面元划分

5)重命名轴索引

3.数据转换

1)分组

2)聚合

3)数据可视化

数据来源

1.加载数据

pandas提供了一些将表格型数据读取为DataFrame对象的函数,其中用的比较多的是read_csv和read_table,参数说明如下:

参数

说明

path 表示文件位置、URL、文件型对象的字符串
sep或delimiter 用于将行中的各字段进行拆分的字符串或正则表达式
head 用作列名的行号
index_col 用作行索引的列编号或列名
skiprows 需要跳过的行号列表(从0开始)
na_value 一组用户替换的值
converters 由列号/列名跟函数之间的映射关系组成的字典
chunksize 文件快的大小

举例:

  1. In [2]: result = pd.read_table('C:\Users\HP\Desktop\SEC-DEBIT_0804.txt',sep = '\s+')
  2.  
  3. In [3]: result
  4. Out[3]:
  5. SEC-DEBIT HKD0002481145000001320170227SECURITIES BUY ON 23Feb2017
  6. 0 10011142009679 HKD00002192568083002000 NaN NaN NaN
  7. 1 20011142009679 HKD00004154719083002000 NaN NaN NaN
  8. 2 30011142005538 HKD00000210215083002300 NaN NaN NaN
  9. 3 40011142005538 HKD00000140211083002300 NaN NaN NaN

延展:

DataFrame写文件:data.to_csv('*.csv')

Series写文件:data.to_csv('*.csv')

Series读文件:Series.from_csv('*.csv')

2.随机采样

利用numpy.random.permutation函数可以实现对Series和DataFrame的列随机重排序工作

  1. In [18]: df = DataFrame(np.arange(20).reshape(5,4))
  2. In [19]: df
  3. Out[19]:
  4. 0 1 2 3
  5. 0 0 1 2 3
  6. 1 4 5 6 7
  7. 2 8 9 10 11
  8. 3 12 13 14 15
  9. 4 16 17 18 19
  10.  
  11. In [20]: sample = np.random.permutation(5)
  12. In [21]: sample
  13. Out[21]: array([0, 1, 4, 2, 3])
  14.  
  15. In [22]: df.take(sample)
  16. Out[22]:
  17. 0 1 2 3
  18. 0 0 1 2 3
  19. 1 4 5 6 7
  20. 4 16 17 18 19
  21. 2 8 9 10 11
  22. 3 12 13 14 15
  23.  
  24. In [25]: df.take(np.random.permutation(5)[:3])
  25. Out[25]:
  26. 0 1 2 3
  27. 2 8 9 10 11
  28. 4 16 17 18 19
  29. 3 12 13 14 15

数据清洗

0.数据统计

  1. In [31]: df = DataFrame({'A':np.random.randn(5),'B':np.random.randn(5)})
  2.  
  3. In [32]: df
  4. Out[32]:
  5. A B
  6. 0 -0.635732 0.738902
  7. 1 -1.100320 0.910203
  8. 2 1.503987 -2.030411
  9. 3 0.548760 0.228552
  10. 4 -2.201917 1.676173
  11.  
  12. In [33]: df.count() #计算个数
  13. Out[33]:
  14. A 5
  15. B 5
  16. dtype: int64
  17. In [34]: df.min() #最小值
  18. Out[34]:
  19. A -2.201917
  20. B -2.030411
  21. dtype: float64
  22. In [35]: df.max() #最大值
  23. Out[35]:
  24. A 1.503987
  25. B 1.676173
  26. dtype: float64
  27. In [36]: df.idxmin() #最小值的位置
  28. Out[36]:
  29. A 4
  30. B 2
  31. dtype: int64
  32. In [37]: df.idxmax() #最大值的位置
  33. Out[37]:
  34. A 2
  35. B 4
  36. dtype: int64
  37. In [38]: df.sum() #求和
  38. Out[38]:
  39. A -1.885221
  40. B 1.523419
  41. dtype: float64
  42. In [39]: df.mean() #平均数
  43. Out[39]:
  44. A -0.377044
  45. B 0.304684
  46. dtype: float64
  47. In [40]: df.median() #中位数
  48. Out[40]:
  49. A -0.635732
  50. B 0.738902
  51. dtype: float64
  52. In [41]: df.mode() #众数
  53. Out[41]:
  54. Empty DataFrame
  55. Columns: [A, B]
  56. Index: []
  57. In [42]: df.var() #方差
  58. Out[42]:
  59. A 2.078900
  60. B 1.973661
  61. dtype: float64
  62. In [43]: df.std() #标准差
  63. Out[43]:
  64. A 1.441839
  65. B 1.404871
  66. dtype: float64
  67. In [44]: df.mad() #平均绝对偏差
  68. Out[44]:
  69. A 1.122734
  70. B 0.964491
  71. dtype: float64
  72. In [45]: df.skew() #偏度
  73. Out[45]:
  74. A 0.135719
  75. B -1.480080
  76. dtype: float64
  77. In [46]: df.kurt() #峰度
  78. Out[46]:
  79. A -0.878539
  80. B 2.730675
  81. dtype: float64
  82. In [48]: df.quantile(0.25) #25%分位数
  83. Out[48]:
  84. A -1.100320
  85. B 0.228552
  86. dtype: float64
  87. In [49]: df.describe() #描述性统计指标
  88. Out[49]:
  89. A B
  90. count 5.000000 5.000000
  91. mean -0.377044 0.304684
  92. std 1.441839 1.404871
  93. min -2.201917 -2.030411
  94. 25% -1.100320 0.228552
  95. 50% -0.635732 0.738902
  96. 75% 0.548760 0.910203
  97. max 1.503987 1.676173

1.处理缺失值

  1. In [50]: string = Series(['apple','banana','pear',np.nan,'grape'])
  2.  
  3. In [51]: string
  4. Out[51]:
  5. 0 apple
  6. 1 banana
  7. 2 pear
  8. 3 NaN
  9. 4 grape
  10. dtype: object
  11.  
  12. In [52]: string.isnull() #判断是否为缺失值
  13. Out[52]:
  14. 0 False
  15. 1 False
  16. 2 False
  17. 3 True
  18. 4 False
  19. dtype: bool
  20.  
  21. In [53]: string.dropna() #过滤缺失值,默认过滤任何含NaN的行
  22. Out[53]:
  23. 0 apple
  24. 1 banana
  25. 2 pear
  26. 4 grape
  27. dtype: object
  28.  
  29. In [54]: string.fillna(0) #填充缺失值
  30. Out[54]:
  31. 0 apple
  32. 1 banana
  33. 2 pear
  34. 3 0
  35. 4 grape
  36. dtype: object
  37.  
  38. In [55]: string.ffill() #向前填充
  39. Out[55]:
  40. 0 apple
  41. 1 banana
  42. 2 pear
  43. 3 pear
  44. 4 grape
  45. dtype: object
  46.  
  47. In [56]: data = DataFrame([[1. ,6.5,3],[1. ,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,7,9]]) #DataFrame操作同理
  48. In [57]: data
  49. Out[57]:
  50. 0 1 2
  51. 0 1.0 6.5 3.0
  52. 1 1.0 NaN NaN
  53. 2 NaN NaN NaN
  54. 3 NaN 7.0 9.0

2.层次化索引

  1. In [6]: data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
  2.  
  3. In [7]: data
  4. Out[7]:
  5. a 1 0.386697
  6. 2 0.822063
  7. 3 0.338441
  8. b 1 0.017249
  9. 2 0.880122
  10. 3 0.296465
  11. c 1 0.376104
  12. 2 -1.309419
  13. d 2 0.512754
  14. 3 0.223535
  15. dtype: float64
  16.  
  17. In [8]: data.index
  18. Out[8]:
  19. MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
  20. labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
  21.  
  22. In [10]: data['b':'c']
  23. Out[10]:
  24. b 1 0.017249
  25. 2 0.880122
  26. 3 0.296465
  27. c 1 0.376104
  28. 2 -1.309419
  29. dtype: float64
  30.  
  31. In [11]: data[:,2]
  32. Out[11]:
  33. a 0.822063
  34. b 0.880122
  35. c -1.309419
  36. d 0.512754
  37. dtype: float64
  38.  
  39. In [12]: data.unstack()
  40. Out[12]:
  41. 1 2 3
  42. a 0.386697 0.822063 0.338441
  43. b 0.017249 0.880122 0.296465
  44. c 0.376104 -1.309419 NaN
  45. d NaN 0.512754 0.223535
  46.  
  47. In [13]: data.unstack().stack()
  48. Out[13]:
  49. a 1 0.386697
  50. 2 0.822063
  51. 3 0.338441
  52. b 1 0.017249
  53. 2 0.880122
  54. 3 0.296465
  55. c 1 0.376104
  56. 2 -1.309419
  57. d 2 0.512754
  58. 3 0.223535
  59. dtype: float64
  60.  
  61. In [14]: df = DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorad
  62. ...: o'],['Green','Red','Green']])
  63.  
  64. In [15]: df
  65. Out[15]:
  66. Ohio Colorado
  67. Green Red Green
  68. a 1 0 1 2
  69. 2 3 4 5
  70. b 1 6 7 8
  71. 2 9 10 11
  72.  
  73. In [16]: df.index.names = ['key1','key2']
  74. In [17]: df.columns.names = ['state','color']
  75.  
  76. In [18]: df
  77. Out[18]:
  78. state Ohio Colorado
  79. color Green Red Green
  80. key1 key2
  81. a 1 0 1 2
  82. 2 3 4 5
  83. b 1 6 7 8
  84. 2 9 10 11
  85.  
  86. In [19]: df['Ohio'] #降维
  87. Out[19]:
  88. color Green Red
  89. key1 key2
  90. a 1 0 1
  91. 2 3 4
  92. b 1 6 7
  93. 2 9 10
  94.  
  95. In [20]: df.swaplevel('key1','key2')
  96. Out[20]:
  97. state Ohio Colorado
  98. color Green Red Green
  99. key2 key1
  100. 1 a 0 1 2
  101. 2 a 3 4 5
  102. 1 b 6 7 8
  103. 2 b 9 10 11
  104.  
  105. In [21]: df.sortlevel(1) #key2
  106. Out[21]:
  107. state Ohio Colorado
  108. color Green Red Green
  109. key1 key2
  110. a 1 0 1 2
  111. b 1 6 7 8
  112. a 2 3 4 5
  113. b 2 9 10 11
  114.  
  115. In [22]: df.sortlevel(0) #key1
  116. Out[22]:
  117. state Ohio Colorado
  118. color Green Red Green
  119. key1 key2
  120. a 1 0 1 2
  121. 2 3 4 5
  122. b 1 6 7 8
  123. 2 9 10 11

3.类sql操作

  1. In [5]: dic = {'Name':['LiuShunxiang','Zhangshan','ryan'],
  2. ...: 'Sex':['M','F','F'],
  3. ...: 'Age':[27,23,24],
  4. ...: 'Height':[165.7,167.2,154],
  5. ...: 'Weight':[61,63,41]}
  6.  
  7. In [6]: student = pd.DataFrame(dic)
  8. In [7]: student
  9. Out[7]:
  10. Age Height Name Sex Weight
  11. 0 27 165.7 LiuShunxiang M 61
  12. 1 23 167.2 Zhangshan F 63
  13. 2 24 154.0 ryan F 41
  14.  
  15. In [8]: dic1 = {'Name':['Ann','Joe'],
  16. ...: 'Sex':['M','F'],
  17. ...: 'Age':[27,33],
  18. ...: 'Height':[168,177.2],
  19. ...: 'Weight':[51,65]}
  20.  
  21. In [9]: student1 = pd.DataFrame(dic1)
  22.  
  23. In [10]: Student = pd.concat([student,student1]) #插入行
  24. In [11]: Student
  25. Out[11]:
  26. Age Height Name Sex Weight
  27. 0 27 165.7 LiuShunxiang M 61
  28. 1 23 167.2 Zhangshan F 63
  29. 2 24 154.0 ryan F 41
  30. 0 27 168.0 Ann M 51
  31. 1 33 177.2 Joe F 65
  32.  
  33. In [14]: pd.DataFrame(Student,columns = ['Age','Height','Name','Sex','Weight','Score']) #新增列
  34. Out[14]:
  35. Age Height Name Sex Weight Score
  36. 0 27 165.7 LiuShunxiang M 61 NaN
  37. 1 23 167.2 Zhangshan F 63 NaN
  38. 2 24 154.0 ryan F 41 NaN
  39. 0 27 168.0 Ann M 51 NaN
  40. 1 33 177.2 Joe F 65 NaN
  41.  
  42. In [16]: Student.ix[Student['Name']=='ryan','Height'] = 160 #修改某个数据
  43. In [17]: Student
  44. Out[17]:
  45. Age Height Name Sex Weight
  46. 0 27 165.7 LiuShunxiang M 61
  47. 1 23 167.2 Zhangshan F 63
  48. 2 24 160.0 ryan F 41
  49. 0 27 168.0 Ann M 51
  50. 1 33 177.2 Joe F 65
  51.  
  52. In [18]: Student[Student['Height']>160] #删选
  53. Out[18]:
  54. Age Height Name Sex Weight
  55. 0 27 165.7 LiuShunxiang M 61
  56. 1 23 167.2 Zhangshan F 63
  57. 0 27 168.0 Ann M 51
  58. 1 33 177.2 Joe F 65
  59.  
  60. In [21]: Student.drop(['Weight'],axis = 1).head() #删除列
  61. Out[21]:
  62. Age Height Name Sex
  63. 0 27 165.7 LiuShunxiang M
  64. 1 23 167.2 Zhangshan F
  65. 2 24 160.0 ryan F
  66. 0 27 168.0 Ann M
  67. 1 33 177.2 Joe F
  68.  
  69. In [22]: Student.drop([1,2]) #删除行索引为1和2的行
  70. Out[22]:
  71. Age Height Name Sex Weight
  72. 0 27 165.7 LiuShunxiang M 61
  73. 0 27 168.0 Ann M 51
  74.  
  75. In [24]: Student.drop(['Age'],axis = 1) #删除列索引为Age的列
  76. Out[24]:
  77. Height Name Sex Weight
  78. 0 165.7 LiuShunxiang M 61
  79. 1 167.2 Zhangshan F 63
  80. 2 154.0 ryan F 41
  81. 0 168.0 Ann M 51
  82. 1 177.2 Joe F 65
  83.  
  84. In [26]: Student.groupby('Sex').agg([np.mean,np.median]) #等价于SELECT…FROM…GROUP BY…功能
  85. Out[26]:
  86. Age Height Weight
  87. mean median mean median mean median
  88. Sex
  89. F 26.666667 24 168.133333 167.20 56.333333 63
  90. M 27.000000 27 166.850000 166.85 56.000000 56
  91.  
  92. In [27]: series = pd.Series(np.random.randint(1,20,5)) #排序
  93. In [28]: series
  94. Out[28]:
  95. 0 9
  96. 1 17
  97. 2 17
  98. 3 13
  99. 4 15
  100. dtype: int32
  101.  
  102. In [29]: series.order() #默认升序
  103. C:/Anaconda2/Scripts/ipython-script.py:1: FutureWarning: order is deprecated, use sort_values(...)
  104. if __name__ == '__main__':
  105. Out[29]:
  106. 0 9
  107. 3 13
  108. 4 15
  109. 1 17
  110. 2 17
  111. dtype: int32
  112.  
  113. In [30]: series.order(ascending = False) #降序
  114. C:/Anaconda2/Scripts/ipython-script.py:1: FutureWarning: order is deprecated, use sort_values(...)
  115. if __name__ == '__main__':
  116. Out[30]:
  117. 2 17
  118. 1 17
  119. 4 15
  120. 3 13
  121. 0 9
  122. dtype: int32
  123.  
  124. In [31]: Student.sort_values(by = ['Height']) #按值排序
  125. Out[31]:
  126. Age Height Name Sex Weight
  127. 2 24 160.0 ryan F 41
  128. 0 27 165.7 LiuShunxiang M 61
  129. 1 23 167.2 Zhangshan F 63
  130. 0 27 168.0 Ann M 51
  131. 1 33 177.2 Joe F 65
  132.  
  133. In [32]: dict2 = {'Name':['ryan','LiuShunxiang','Zhangshan','Ann','Joe'],
  134. ...: 'Score':['','','','','']}
  135.  
  136. In [33]: Score = pd.DataFrame(dict2)
  137. In [34]: Score
  138. Out[34]:
  139. Name Score
  140. 0 ryan 89
  141. 1 LiuShunxiang 90
  142. 2 Zhangshan 78
  143. 3 Ann 60
  144. 4 Joe 53
  145.  
  146. In [35]: stu_score = pd.merge(Student,Score,on = 'Name') #表连接
  147. In [36]: stu_score
  148. Out[36]:
  149. Age Height Name Sex Weight Score
  150. 0 27 165.7 LiuShunxiang M 61 90
  151. 1 23 167.2 Zhangshan F 63 78
  152. 2 24 160.0 ryan F 41 89
  153. 3 27 168.0 Ann M 51 60
  154. 4 33 177.2 Joe F 65 53

注:student1以dic形式转DataFrame对象和直接新建DataFrame对象,连接结果不同

  1. In [71]:student1 =  DataFrame({'name';['Ann','Joe'],'Sex':['M','F'],'Age':[27,33],'Height':[168,177.2],'Weight':[51,65
        ...: ]})
    In [72]: student
  2. Out[72]:
  3. Age Height Name Sex Weight
  4. 0 27 165.7 LiuShunxiang M 61
  5. 1 23 167.2 Zhangshan F 63
  6. 2 24 154.0 ryan F 41
  7.  
  8. In [73]: student1
  9. Out[73]:
  10. Age Height Sex Weight name
  11. 0 27 168.0 M 51 Ann
  12. 1 33 177.2 F 65 Joe
  13.  
  14. In [74]: Student = pd.concat([student,student1])
  15. In [75]: Student
  16. Out[75]:
  17. Age Height Name Sex Weight name
  18. 0 27 165.7 LiuShunxiang M 61 NaN
  19. 1 23 167.2 Zhangshan F 63 NaN
  20. 2 24 154.0 ryan F 41 NaN
  21. 0 27 168.0 NaN M 51 Ann
  22. 1 33 177.2 NaN F 65 Joe

延伸表连接,merge函数参数说明如下:

参数 说明
left 参与合并的左侧DataFrame
right 参与合并的右侧DataFrame
how "inner"、"outer"、"left"、"right"其中之一,默认为inner
on 用于连接的列名
left_on 左侧DataFrame中用作连接键的列
right_on 右侧DataFrame中用作连接键的列
left_index 将左侧DataFrame中的行索引作为连接的键
right_index 将右侧DataFrame中的行索引作为连接的键
sort 根据连接键对合并后的数据进行排序

举例如下

  1. In [5]: df1 = DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)})
  2. In [6]: df1
  3. Out[6]:
  4. data1 key
  5. 0 0 b
  6. 1 1 b
  7. 2 2 a
  8. 3 3 c
  9. 4 4 a
  10. 5 5 a
  11. 6 6 b
  12.  
  13. In [7]: df2 = DataFrame({'key':['a','b','d'],'data2':range(3)})
  14. In [8]: df2
  15. Out[8]:
  16. data2 key
  17. 0 0 a
  18. 1 1 b
  19. 2 2 d
  20.  
  21. In [9]: pd.merge(df1,df2) #默认内链接,合并相同的key即a,b
  22. Out[9]:
  23. data1 key data2
  24. 0 0 b 1
  25. 1 1 b 1
  26. 2 6 b 1
  27. 3 2 a 0
  28. 4 4 a 0
  29. 5 5 a 0
  30.  
  31. In [10]: df3 = DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
  32. In [11]: df4 = DataFrame({'rkey':['a','b','d'],'data2':range(3)})
  33.  
  34. In [12]: df3
  35. Out[12]:
  36. data1 lkey
  37. 0 0 b
  38. 1 1 b
  39. 2 2 a
  40. 3 3 c
  41. 4 4 a
  42. 5 5 a
  43. 6 6 b
  44.  
  45. In [13]: df4
  46. Out[13]:
  47. data2 rkey
  48. 0 0 a
  49. 1 1 b
  50. 2 2 d
  51.  
  52. In [14]: print pd.merge(df3,df4,left_on = 'lkey',right_on = 'rkey')
  53. data1 lkey data2 rkey
  54. 0 0 b 1 b
  55. 1 1 b 1 b
  56. 2 6 b 1 b
  57. 3 2 a 0 a
  58. 4 4 a 0 a
  59. 5 5 a 0 a
  60.  
  61. In [15]: print pd.merge(df3,df4,left_on = 'lkey',right_on = 'data2')
  62. Empty DataFrame
  63. Columns: [data1, lkey, data2, rkey]
  64. Index: []
  65.  
  66. In [16]: print pd.merge(df1,df2,how = 'outer')
  67. data1 key data2
  68. 0 0.0 b 1.0
  69. 1 1.0 b 1.0
  70. 2 6.0 b 1.0
  71. 3 2.0 a 0.0
  72. 4 4.0 a 0.0
  73. 5 5.0 a 0.0
  74. 6 3.0 c NaN
  75. 7 NaN d 2.0
  76.  
  77. In [17]: df5 = DataFrame({'key':list('bbacab'),'data1':range(6)})
  78. In [18]: df6 = DataFrame({'key':list('ababd'),'data2':range(5)})
  79. In [19]: df5
  80. Out[19]:
  81. data1 key
  82. 0 0 b
  83. 1 1 b
  84. 2 2 a
  85. 3 3 c
  86. 4 4 a
  87. 5 5 b
  88.  
  89. In [20]: df6
  90. Out[20]:
  91. data2 key
  92. 0 0 a
  93. 1 1 b
  94. 2 2 a
  95. 3 3 b
  96. 4 4 d
  97.  
  98. In [21]: print pd.merge(df5,df6,on = 'key',how = 'left')
  99. data1 key data2
  100. 0 0 b 1.0
  101. 1 0 b 3.0
  102. 2 1 b 1.0
  103. 3 1 b 3.0
  104. 4 2 a 0.0
  105. 5 2 a 2.0
  106. 6 3 c NaN
  107. 7 4 a 0.0
  108. 8 4 a 2.0
  109. 9 5 b 1.0
  110. 10 5 b 3.0
  111.  
  112. In [22]: left = DataFrame({'key1':['foo','foo','bar'],'key2':['one','two','one'],'lval':[1,2,3]})
  113. In [23]: right = DataFrame({'key1':['foo','foo','bar','bar'],'key2':['one','one','one','two'],'rval':[4,5,6,7]})
  114. In [24]: left
  115. Out[24]:
  116. key1 key2 lval
  117. 0 foo one 1
  118. 1 foo two 2
  119. 2 bar one 3
  120.  
  121. In [25]: right
  122. Out[25]:
  123. key1 key2 rval
  124. 0 foo one 4
  125. 1 foo one 5
  126. 2 bar one 6
  127. 3 bar two 7
  128.  
  129. In [26]: print pd.merge(left,right,on = ['key1','key2'],how = 'outer')
  130. key1 key2 lval rval
  131. 0 foo one 1.0 4.0
  132. 1 foo one 1.0 5.0
  133. 2 foo two 2.0 NaN
  134. 3 bar one 3.0 6.0
  135. 4 bar two NaN 7.0
  136.  
  137. In [27]: print pd.merge(left,right,on = 'key1')
  138. key1 key2_x lval key2_y rval
  139. 0 foo one 1 one 4
  140. 1 foo one 1 one 5
  141. 2 foo two 2 one 4
  142. 3 foo two 2 one 5
  143. 4 bar one 3 one 6
  144. 5 bar one 3 two 7
  145.  
  146. In [28]: print pd.merge(left,right,on = 'key1',suffixes = ('_left','_right'))
  147. key1 key2_left lval key2_right rval
  148. 0 foo one 1 one 4
  149. 1 foo one 1 one 5
  150. 2 foo two 2 one 4
  151. 3 foo two 2 one 5
  152. 4 bar one 3 one 6
  153. 5 bar one 3 two 7

4.离散化面元划分

  1. In [17]: age = [20,22,25,27,21,23,37,31,61,45,41,32]
  2. In [18]: bins = [18,25,35,60,100]
  3.  
  4. In [19]: cats = pd.cut(age,bins)
  5. In [20]: cats
  6. Out[20]:
  7. [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
  8. Length: 12
  9. Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
  10.  
  11. In [26]: group_names = ['YoungAdult','Adult','MiddleAged','Senior']
  12.  
  13. In [27]: pd.cut(age,bins,labels = group_names) #设置面元名称
  14. Out[27]:
  15. [YoungAdult, YoungAdult, YoungAdult, Adult, YoungAdult, ..., Adult, Senior, MiddleAged, MiddleAged, Adult]
  16. Length: 12
  17. Categories (4, object): [YoungAdult < Adult < MiddleAged < Senior]
  18.  
  19. In [28]: data = np.random.randn(10)
  20.  
  21. In [29]: cats = pd.qcut(data,4) #qcut提供根据样本分位数对数据进行面元划分
  22. In [30]: cats
  23. Out[30]:
  24. [(0.268, 0.834], (-0.115, 0.268], (0.268, 0.834], [-1.218, -0.562], (-0.562, -0.115], [-1.218, -0.562], (-0.115, 0.268], [-1.218, -0.562], (0.268, 0.834], (-0.562, -0.115]]
  25. Categories (4, object): [[-1.218, -0.562] < (-0.562, -0.115] < (-0.115, 0.268] < (0.268, 0.834]]
  26.  
  27. In [33]: pd.value_counts(cats)
  28. Out[33]:
  29. (0.268, 0.834] 3
  30. [-1.218, -0.562] 3
  31. (-0.115, 0.268] 2
  32. (-0.562, -0.115] 2
  33. dtype: int64
  34.  
  35. In [35]: pd.qcut(data,[0.1,0.5,0.9,1.]) #自定义分位数,[0-1]的数值
  36. Out[35]:
  37. [(-0.115, 0.432], (-0.115, 0.432], (0.432, 0.834], NaN, [-0.787, -0.115], [-0.787, -0.115], (-0.115, 0.432], [-0.787, -0.115], (-0.115, 0.432], [-0.787, -0.115]]
  38. Categories (3, object): [[-0.787, -0.115] < (-0.115, 0.432] < (0.432, 0.834]]

5.重命名轴索引

  1. In [36]: data = DataFrame(np.arange(12).reshape(3,4),index = ['Ohio','Colorado','New York'],columns = ['one','two','thr
  2. ...: ee','four'])
  3.  
  4. In [37]: data
  5. Out[37]:
  6. one two three four
  7. Ohio 0 1 2 3
  8. Colorado 4 5 6 7
  9. New York 8 9 10 11
  10.  
  11. In [38]: data.index = data.index.map(str.upper)
  12. In [39]: data
  13. Out[39]:
  14. one two three four
  15. OHIO 0 1 2 3
  16. COLORADO 4 5 6 7
  17. NEW YORK 8 9 10 11
  18.  
  19. In [40]: data.rename(index = str.title,columns=str.upper)
  20. Out[40]:
  21. ONE TWO THREE FOUR
  22. Ohio 0 1 2 3
  23. Colorado 4 5 6 7
  24. New York 8 9 10 11
  25.  
  26. In [41]: data.rename(index={'OHIO':'INDIANA'},columns={'three':'ryana'}) #对部分轴标签更新
  27. Out[41]:
  28. one two ryana four
  29. INDIANA 0 1 2 3
  30. COLORADO 4 5 6 7
  31. NEW YORK 8 9 10 11

数据转换

1.分组

  1. In [42]: df = DataFrame({'key1':['a','a','b','b','a'],'key2':['one','two','one','two','one'],'data1':np.random.randn(5)
  2. ...: ,'data2':np.random.randn(5)})
  3.  
  4. In [43]: df
  5. Out[43]:
  6. data1 data2 key1 key2
  7. 0 0.762448 0.816634 a one
  8. 1 1.412613 0.867923 a two
  9. 2 0.899297 -1.049657 b one
  10. 3 0.912080 0.628012 b two
  11. 4 -0.549258 -1.327614 a one
  12.  
  13. In [44]: grouped = df['data1'].groupby(df['key1']) #按key1列分组,计算data1列的平均值
  14. In [45]: grouped
  15. Out[45]: <pandas.core.groupby.SeriesGroupBy object at 0x00000000073C97F0>
  16.  
  17. In [46]: grouped.mean()
  18. Out[46]:
  19. key1
  20. a 0.541935
  21. b 0.905688
  22. Name: data1, dtype: float64
  23.  
  24. In [48]: df['data1'].groupby([df['key1'],df['key2']]).mean()
  25. Out[48]:
  26. key1 key2
  27. a one 0.106595
  28. two 1.412613
  29. b one 0.899297
  30. two 0.912080
  31. Name: data1, dtype: float64
  32.  
  33. In [49]: df.groupby('key1').mean() #根据列名分组
  34. Out[49]:
  35. data1 data2
  36. key1
  37. a 0.541935 0.118981
  38. b 0.905688 -0.210822
  39.  
  40. In [50]: df.groupby(['key1','key2']).mean()
  41. Out[50]:
  42. data1 data2
  43. key1 key2
  44. a one 0.106595 -0.255490
  45. two 1.412613 0.867923
  46. b one 0.899297 -1.049657
  47. two 0.912080 0.628012
  48.  
  49. In [51]: df.groupby('key1')['data1'].mean() #选取部分列进行聚合
  50. Out[51]:
  51. key1
  52. a 0.541935
  53. b 0.905688
  54. Name: data1, dtype: float64
  55.  
  56. In [52]: df.groupby(['key1','key2'])['data1'].mean()
  57. Out[52]:
  58. key1 key2
  59. a one 0.106595
  60. two 1.412613
  61. b one 0.899297
  62. two 0.912080
  63. Name: data1, dtype: float64
  64.  
  65. In [53]: people = DataFrame(np.random.randn(5,5),columns = ['a','b','c','d','e'],index = ['Joe','Steve','Wes','Jim','Tr
        ...: avis'])
  66.  
  67. In [54]: people
    Out[54]:
                   a         b         c         d         e
    Joe     0.223628 -0.282831  0.368583  0.246665 -0.815742
    Steve   0.662181  0.187961  0.515883 -2.021429 -0.624596
    Wes    -1.009086  0.450082 -0.819855 -1.626971  0.632064
    Jim     1.593881  0.803760 -0.209345 -1.295325 -0.553693
    Travis -0.041911  1.115285 -1.648207  0.521751 -0.414183
  68.  
  69. In [55]: mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
    In [56]: map_series = Series(mapping)
    In [57]: map_series
    Out[57]:
    a       red
    b       red
    c      blue
    d      blue
    e       red
    f    orange
    dtype: object
  70.  
  71. In [58]: people.groupby(map_series,axis = 1).count() #根据series分组
    Out[58]:
            blue  red
    Joe        2    3
    Steve      2    3
    Wes        2    3
    Jim        2    3
    Travis     2    3
  72.  
  73. In [59]: by_columns = people.groupby(mapping,axis =1) #根据字典分组
    In [60]: by_columns.sum()
    Out[60]:
                blue       red
    Joe     0.615248 -0.874945
    Steve  -1.505546  0.225546
    Wes    -2.446826  0.073060
    Jim    -1.504670  1.843948
    Travis -1.126456  0.659191
  74.  
  75. In [61]: people.groupby(len).sum() #根据函数分组
    Out[61]:
              a         b         c         d         e
    3  0.808423  0.971012 -0.660617 -2.675632 -0.737371
    5  0.662181  0.187961  0.515883 -2.021429 -0.624596
    6 -0.041911  1.115285 -1.648207  0.521751 -0.414183
  76.  
  77. In [63]: columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names = ['city','tennor'])
    In [65]: df1 = DataFrame(np.random.randn(4,5),columns = columns)
    In [66]: df1
    Out[66]:
    city          US                            JP
    tennor         1         3         5         1         3
    0       1.103548  1.087425  0.717741 -0.354419  1.294512
    1      -0.247544 -1.247665  1.340309  1.337957  0.528693
    2       2.168903 -0.124958  0.367158  0.478355 -0.828126
    3      -0.078540 -3.062132 -2.095675 -0.879590 -0.020314
  78.  
  79. In [67]: df1.groupby(level = 'city',axis = 1).count() #根据索引级别分组
    Out[67]:
    city  JP  US
    0      2   3
    1      2   3
    2      2   3
    3      2   3

2.透视表
pandas为我们提供了实现数据透视表功能的函数pivot_table(),该函数参数说明如下:

参数 说明
data 需要进行透视的数据
value 指定需要聚合的字段
index 指定值为行索引
columns 指定值为列索引
aggfunc 聚合函数
fill_value 常量替换缺失值,默认不替换
margins 是否合并,默认否
dropna 是否观测缺失值,默认是

举例:

  1. In [68]: dic = {'Name':['LiuShunxiang','Zhangshan','ryan'],
  2. ...: ...: 'Sex':['M','F','F'],
  3. ...: ...: 'Age':[27,23,24],
  4. ...: ...: 'Height':[165.7,167.2,154],
  5. ...: ...: 'Weight':[61,63,41]}
  6. ...:
  7. In [69]: student = pd.DataFrame(dic)
  8. In [70]: student
  9. Out[70]:
  10. Age Height Name Sex Weight
  11. 0 27 165.7 LiuShunxiang M 61
  12. 1 23 167.2 Zhangshan F 63
  13. 2 24 154.0 ryan F 41
  14.  
  15. In [71]: pd.pivot_table(student,values = ['Height'],columns = ['Sex']) #'Height'作为数值变量,'Sex'作为分组变量
  16. Out[71]:
  17. Sex F M
  18. Height 160.6 165.7
  19.  
  20. In [72]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex','Age'])
  21. Out[72]:
  22. Sex Age
  23. Height F 23 167.2
  24. 24 154.0
  25. M 27 165.7
  26. Weight F 23 63.0
  27. 24 41.0
  28. M 27 61.0
  29. dtype: float64
  30.  
  31. In [73]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex','Age']).unstack()
  32. Out[73]:
  33. Age 23 24 27
  34. Sex
  35. Height F 167.2 154.0 NaN
  36. M NaN NaN 165.7
  37. Weight F 63.0 41.0 NaN
  38. M NaN NaN 61.0
  39.  
  40. In [74]: pd.pivot_table(student,values = ['Height','Weight'],columns = ['Sex'],aggfunc = [np.mean,np.median,np.std])
  41. Out[74]:
  42. mean median std
  43. Sex F M F M F M
  44. Height 160.6 165.7 160.6 165.7 9.333810 NaN
  45. Weight 52.0 61.0 52.0 61.0 15.556349 NaN

3.数据可视化

plot参数说明

Series.plot()方法 DataFrame.plot()方法
参数 说明 参数 说明
label 用于图例的标签 subplot 将各个DataFrame对象绘制到各subplot中
ax matplotlib.subplot对象 sharex 若subplot = True,则共用同一X轴,包括刻度和界限
style 风格字符串 sharey 若subplot = True,则共用同一X轴,包括刻度和界限
alpha 图表填充的不透明度 figsize 表示图像大小的元组
kind 可以是'line','bar','barh','kde' title 表示图像标题的字符串
xtick 用作X轴刻度的值 legend 添加一个subplot图例,默认True
Ytick 用作Y轴刻度的值 sort_columns 以字母表顺序绘制各列,默认使用当前列顺序
Xlim X轴的界限    
Ylim Y轴的界限    

1)线性图

  1. In [76]: s = Series(np.random.randn(10).cumsum(),index = np.arange(0,100,10))
  2. In [77]: s.plot()

  1. In [78]: df = DataFrame(np.random.randn(10,4).cumsum(0),columns = ['A','B','C','D'],index = np.arange(0,100,10))
  2. In [79]: df.plot()

2)柱状图

  1. In [80]:fig,axes = plt.subplots(2,1)
  2. In [81]:data = Series(np.random.rand(16),index=list('abcdefghijklmnop'))
  3.  
  4. In [82]:data.plot(kind = 'bar',ax = axes[0],color = 'k',alpha = 0.7)
  5. In [83]:data.plot(kind = 'barh',ax = axes[1],color = 'k',alpha = 0.7)

  1. In [84]:df = DataFrame(np.random.rand(6,4),index = ['one','two','three','four','five','six'],columns = pd.Index(['A','B','C','D'],name = 'Genus'))
  2. In [85]:df.plot(kind = 'bar')

3)密度图

  1. In [87]:comp1 = np.random.normal(0,1,size = 100)
  2. In [88]:comp2 = np.random.normal(10,2,size = 100)
  3.  
  4. In [89]:values = Series(np.concatenate([comp1,comp2]))
  5. In [90]:values.hist(bins = 50,alpha = 0.3,color = 'r',normed = True)
  6. In [91]:values.plot(kind = 'kde',style = 'k--')

4)散点图

  1. In [7]: import tushare as ts
  2. In [8]: data = ts.get_hist_data('',start='2017-08-15')
  3. In [9]: pieces = data[['close', 'price_change', 'ma20','volume', 'v_ma20', 'turnover']]
  4. In [10]: pd.scatter_matrix(pieces)

5)热力图

  1. In [11]: cov = np.corrcoef(pieces.T)
  2. In [12]: img = plt.matshow(cov,cmap=plt.cm.summer)
  3. In [13]: plt.colorbar(img, ticks=[-1,0,1])

pandas深入理解的更多相关文章

  1. python pandas 基础理解

    其实每一篇博客我都要用很多琐碎的时间片段来学完写完,每次一点点,用到了就学一点,学一点就记录一点,要用上好几天甚至一两个礼拜才感觉某一小类的知识结构学的差不多了. Pandas 是基于 NumPy 的 ...

  2. Pandas系列教程——写在前面

    之前搜pandas资料,发现互联网上并没有成体系的pandas教程,于是乎突然有个爱迪页儿,打算自己把官网的文档加上自己用pandas的理解,写成一个系列的教程, 巩固自己,方便他人 接下来就干这件事 ...

  3. 利用Python进行数据分析-Pandas(第四部分-数据清洗和准备)

    在数据分析和建模的过程中,相当多的时间要用在数据准备上:加载.清理.转换以及重塑上.这些工作会占到分析时间的80%或更多.有时,存储在文件和数据库中的数据的格式不适合某个特定的任务.研究者都选择使用编 ...

  4. 利用Python进行数据分析 第7章 数据清洗和准备(1)

    学习时间:2019/10/25 周五晚上22点半开始. 学习目标:Page188-Page217,共30页,目标6天学完,每天5页,预期1029学完. 实际反馈:集中学习1.5小时,学习6页:集中学习 ...

  5. 数据转换--替换值(replace函数)

    替换值 replace函数 data=Series([1,-999,2,-999,-1000,3]) data Out[34]: 0 1 1 -999 2 2 3 -999 4 -1000 5 3 d ...

  6. 深入理解pandas读取excel,txt,csv文件等命令

    pandas读取文件官方提供的文档 在使用pandas读取文件之前,必备的内容,必然属于官方文档,官方文档查阅地址 http://pandas.pydata.org/pandas-docs/versi ...

  7. 1.理解Numpy、pandas

    之前一直做得只是采集数据,而没有再做后期对数据的处理分析工作,自己也是有意愿去往这些方向学习的,最近就在慢慢的接触. 首先简单理解一下numpy和pandas:一.NumPy:1.NumPy是高性能计 ...

  8. pandas 的axis参数的理解

    # pandas的axis参数怎样理解? # axis=0 或者 "index": # 如果是单行操作,就指的是某一行 # 如果是聚合操作,指的是跨行cross rows # ax ...

  9. 深入理解和运用Pandas的GroupBy机制——理解篇

    GroupBy是Pandas提供的强大的数据聚合处理机制,可以对大量级的多维数据进行透视,同时GroupBy还提供强大的apply函数,使得在多维数据中应用复杂函数得到复杂结果成为可能(这也是个人认为 ...

随机推荐

  1. js重定向

    在现行的网站应用中URL重定向的应用有很多: 404页面处理.网址改变(t.sina转到weibo.com).多个网站地址(如:http://www.google.com/ .www.g.cn )等: ...

  2. HUST 1585 排队

    2019-05-21 10:15:00 加油,加油 !!! #include <bits/stdc++.h> using namespace std; int main() { int n ...

  3. 带中横线的日期格式在iOS手机系统上 转换时间戳NaN问题

    类似于 '2019-04-01 14:13:00' 这样的日期格式转换时间戳在iOS手机上是无法转换的,需要先处理日期格式成 '2019/04/01 14:13:00' var str = '2019 ...

  4. 第三课 创建函数 - 从EXCEL读取 - 导出到EXCEL - 异常值 - Lambda函数 - 切片和骰子数据

    第 3 课   获取数据 - 我们的数据集将包含一个Excel文件,其中包含每天的客户数量.我们将学习如何对 excel 文件进​​行处理.准备数据 - 数据是有重复日期的不规则时间序列.我们将挑战数 ...

  5. 9) 十分钟学会android--使用Fragment建立动态UI

    为了在 Android 上为用户提供动态的.多窗口的交互体验,需要将 UI 组件和 Activity 操作封装成模块进行使用,这样我们就可以在 Activity 中对这些模块进行切入切出操作.可以用  ...

  6. 运行Tomcat闪退问题,报的错误:Unsupported major.minor version 51.0

    在MyEclipse中运行tomcat,tomcat闪退并且报以下错误. java.lang.UnsupportedClassVersionError: org/apache/catalina/sta ...

  7. 使用UMDH进行内心泄露分析

    事前准备 1.安装windbg,安装好后在path中添加其安装目录(目的是为了执行命令行简单) 2.(设置符号路径,一般为接口所在路径)运行cmd,执行命令:set _NT_SYMBOL_PATH=d ...

  8. RabbitMQ学习之Flow Control

    当RabbitMQ发布消息速度快于消费速度或者系统资源不足时,RabbitMQ将降低或阻断发布消息速度,以免服务器资源饱满而宕机,可以通过rabbitmqctl和web管理页面查看连接的状态为flow ...

  9. 【ES6】 Promise / await / async的使用

    为什么需要在项目中引入promise? 项目起因:我们在页面中经常需要多次调用接口,而且接口必须是按顺序串联调用 (即A接口调用完毕,返回数据后,再调用B接口) 这样就会造成多次回调,代码长得丑,而且 ...

  10. PhotoZoom放大的图片效果怎么样?清不清晰?

    PhotoZoom是一款使用了革命性技术.效果最好的图像无损放大工具.它可以对图片进行放大而没有锯齿,不会失真,让您无与伦比完美放大图像质量. PhotoZoom Pro使用了S-Spline Max ...