Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）

入门学习马上结束辽。

1.Pandas库

import pandas as pd

两个数据类型：Series，DataFrame

Series类型：数据+索引

自定义索引

b = pd.Series([9,8,7,6],index=['a','b','c','d'])

b

Out[3]:

a    9

b    8

c    7

d    6

dtype: int64

从标量值创建

s = pd.Series(25,index=['a','b','c'])#index=不能省略

s

Out[7]:

a    25

b    25

c    25

dtype: int64

从字典类型创建

d = pd.Series({'a':9,'b':8,'c':7})

d

Out[9]:

a    9

b    8

c    7

dtype: int64

从ndarray类型创建

import numpy as np

n = pd.Series(np.arange(5))

n

Out[12]:

0    0

1    1

2    2

3    3

4    4

dtype: int32

基本操作

b = pd.Series([9,8,7,6],['a','b','c','d'])

b

Out[14]:

a    9

b    8

c    7

d    6

dtype: int64

b.index

Out[15]: Index(['a', 'b', 'c', 'd'], dtype='object')

b.values

Out[17]: array([9, 8, 7, 6], dtype=int64)

　b.get('d',100)
　Out[18]: 6

Series对象和索引都可以有一个名字，存储在属性.name中

DataFrame类型：共用相同索引的多列数据

从二维ndarray对象创建

import pandas as pd

import numpy as np

d = pd.DataFrame(np.arange(10),reshape(2,5))

Traceback (most recent call last):

  File "<ipython-input-3-8f29c41caece>", line 1, in <module>

    d = pd.DataFrame(np.arange(10),reshape(2,5))

NameError: name 'reshape' is not defined

d = pd.DataFrame(np.arange(10).reshape(2,5))

d

Out[5]:

0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9

从一维ndarray对象字典创建

dt = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}

d = pd.DataFrame(dt)

d

Out[11]:

   one  two

a  1.0    9

b  2.0    8

c  3.0    7

d  NaN    6

pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])

Out[13]:

   two three

b    8   NaN

c    7   NaN

d    6   NaN

从列表类型的字典创建

d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}

d = pd.DataFrame(d1,index=['a','b','c','d'])

d

Out[16]:

   one  two

a    1    9

b    2    8

c    3    7

d    4    6

数据类型操作

如何改变Series和DataFrame对象？

增加或重排：重新索引

.reindex()

import pandas as pd

d1 = {'城市':['北京','上海','广州','深圳','沈阳'],

'环比':[101.5,101.2,101.3,102.0,100.1],

'同比':[101.5,101.2,101.3,102.0,100.1],

'定基':[101.5,101.2,101.3,102.0,100.1]}

d = pd.DataFrame(d1,index=[1,2,3,4,5])

d

Out[4]:

      同比  城市     定基     环比

1  101.5  北京  101.5  101.5

2  101.2  上海  101.2  101.2

3  101.3  广州  101.3  101.3

4  102.0  深圳  102.0  102.0

5  100.1  沈阳  100.1  100.1

d = d.reindex(index=[5,4,3,2,1])

d

Out[6]:

      同比  城市     定基     环比

5  100.1  沈阳  100.1  100.1

4  102.0  深圳  102.0  102.0

3  101.3  广州  101.3  101.3

2  101.2  上海  101.2  101.2

1  101.5  北京  101.5  101.5

d = d.reindex(columns=['城市','同比','环比','定基'])

d

Out[8]:

   城市     同比     环比     定基

5  沈阳  100.1  100.1  100.1

4  深圳  102.0  102.0  102.0

3  广州  101.3  101.3  101.3

2  上海  101.2  101.2  101.2

1  北京  101.5  101.5  101.5

其他参数：

fill_value：重新索引中，勇于填充缺失位置的值

method：填充方法，fill当前值向前填充，bfill向后填充

limit：最大填充量

copy：默认True，生成新的对象，False时，新旧相等不复制

索引类型的常用方法：

.append(idx)：连接另一个Index对象，产生新的Index对象

.diff(idx)：计算差集，产生新的Index对象

.intersection(idx)：计算交集

.union(idx)：计算并集

.delete(loc)：删除loc位置处的元素

.insert(loc,e)：在loc位置增加一个元素e

nc = d.columns.delete(2)

ni = d.index.insert(5,6)

nd = d.reindex(index=ni,columns=nc,method='ffill')

Traceback (most recent call last):

  File "<ipython-input-11-ba08f80a2d41>", line 1, in <module>

    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex

    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex

    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes

    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer

    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer

    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted

    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic

    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing

ni = d.index.insert(5,0)

nd = d.reindex(index=ni,columns=nc,method='ffill')

Traceback (most recent call last):

  File "<ipython-input-13-ba08f80a2d41>", line 1, in <module>

    nd = d.reindex(index=ni,columns=nc,method='ffill')

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2831, in reindex

    **kwargs)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2404, in reindex

    fill_value, copy).__finalize__(self)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2772, in _reindex_axes

    fill_value, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2794, in _reindex_columns

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2833, in reindex

    tolerance=tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2538, in get_indexer

    indexer = self._get_fill_indexer(target, method, limit, tolerance)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2564, in _get_fill_indexer

    limit)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2585, in _get_fill_indexer_searchsorted

    side)

  File "C:\Users\ASUS\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3394, in _searchsorted_monotonic

    raise ValueError('index must be monotonic increasing or decreasing')

ValueError: index must be monotonic increasing or decreasing

nd = d.reindex(index=ni,columns=nc).ffill()

nd

Out[15]:

   城市     同比     定基

5  沈阳  100.1  100.1

4  深圳  102.0  102.0

3  广州  101.3  101.3

2  上海  101.2  101.2

1  北京  101.5  101.5

0  北京  101.5  101.5

ValueError: index must be monotonic increasing or decreasing

解决方法见代码

删除：drop

a = pd.Series([9,8,7,6],index=['a','b','c','d'])

a

Out[17]:

a    9

b    8

c    7

d    6

dtype: int64

a.drop(['b','c'])

Out[18]:

a    9

d    6

dtype: int64

pandas库的数据类型运算：

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12),reshape(3,4))

Traceback (most recent call last):

  File "<ipython-input-21-a8c747b1897a>", line 1, in <module>

    a = pd.DataFrame(np.arange(12),reshape(3,4))

NameError: name 'reshape' is not defined

a = pd.DataFrame(np.arange(12).reshape(3,4))

a

Out[23]:

   0  1   2   3

0  0  1   2   3

1  4  5   6   7

2  8  9  10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

b

Out[25]:

    0   1   2   3   4

0   0   1   2   3   4

1   5   6   7   8   9

2  10  11  12  13  14

3  15  16  17  18  19

a+b

Out[26]:

      0     1     2     3   4

0   0.0   2.0   4.0   6.0 NaN

1   9.0  11.0  13.0  15.0 NaN

2  18.0  20.0  22.0  24.0 NaN

3   NaN   NaN   NaN   NaN NaN

b.add(a,fill_value = 0)

Out[27]:

      0     1     2     3     4

0   0.0   2.0   4.0   6.0   4.0

1   9.0  11.0  13.0  15.0   9.0

2  18.0  20.0  22.0  24.0  14.0

3  15.0  16.0  17.0  18.0  19.0

a.mul(b,fill_value = 0)

Out[28]:

      0     1      2      3    4

0   0.0   1.0    4.0    9.0  0.0

1  20.0  30.0   42.0   56.0  0.0

2  80.0  99.0  120.0  143.0  0.0

3   0.0   0.0    0.0    0.0  0.0

不同维度间为广播运算：

b = pd.DataFrame(np.arange(20).reshape(4,5))

b

Out[31]:

    0   1   2   3   4

0   0   1   2   3   4

1   5   6   7   8   9

2  10  11  12  13  14

3  15  16  17  18  19

c =pd.Series(np.arange(4))

c

Out[33]:

0    0

1    1

2    2

3    3

dtype: int32

c-10

Out[34]:

0   -10

1    -9

2    -8

3    -7

dtype: int32

b-c

Out[35]:

      0     1     2     3   4

0   0.0   0.0   0.0   0.0 NaN

1   5.0   5.0   5.0   5.0 NaN

2  10.0  10.0  10.0  10.0 NaN

3  15.0  15.0  15.0  15.0 NaN

b.sub(c,axis=0)
Out[36]:
0 1 2 3 4
0 0 1 2 3 4
1 4 5 6 7 8
2 8 9 10 11 12
3 12 13 14 15 16

排序：

.sort_index()方法在指定轴上根据索引进行排序，默认升序。

.sort_index(axis=0,ascending=True)

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])

b

Out[4]:

    0   1   2   3   4

c   0   1   2   3   4

a   5   6   7   8   9

d  10  11  12  13  14

b  15  16  17  18  19

b.sort_index()

Out[5]:

    0   1   2   3   4

a   5   6   7   8   9

b  15  16  17  18  19

c   0   1   2   3   4

d  10  11  12  13  14

b.sort_index(ascending=False)

Out[6]:

    0   1   2   3   4

d  10  11  12  13  14

c   0   1   2   3   4

b  15  16  17  18  19

a   5   6   7   8   9

.sort_values()方法在指定轴上根据数值进行排序，默认升序

Series.sort_values(axis=0,ascending=True)

DataFrame(by,axis=0,ascending=True)

by:axis轴上某个索引或索引列表

NaN统一放到排序末尾

基本统计分析：

.describe()

a = pd.Series([9,8,7,6])

a

Out[8]:

0    9

1    8

2    7

3    6

dtype: int64

a.describe()

Out[9]:

count    4.000000

mean     7.500000

std      1.290994

min      6.000000

25%      6.750000

50%      7.500000

75%      8.250000

max      9.000000

dtype: float64

a.describe()['count']

Out[10]: 4.0

b.describe()

Out[11]:

               0          1          2          3          4

count   4.000000   4.000000   4.000000   4.000000   4.000000

mean    7.500000   8.500000   9.500000  10.500000  11.500000

std     6.454972   6.454972   6.454972   6.454972   6.454972

min     0.000000   1.000000   2.000000   3.000000   4.000000

25%     3.750000   4.750000   5.750000   6.750000   7.750000

50%     7.500000   8.500000   9.500000  10.500000  11.500000

75%    11.250000  12.250000  13.250000  14.250000  15.250000

max    15.000000  16.000000  17.000000  18.000000  19.000000

b.describe()[2]

Out[12]:

count     4.000000

mean      9.500000

std       6.454972

min       2.000000

25%       5.750000

50%       9.500000

75%      13.250000

max      17.000000

Name: 2, dtype: float64

数据的累计统计分析：

.cumsum()依次给出前1、2、。。。n个数的和

.cumprod()积

.cummax()最大值

.cummin()最小值

b.cumsum()

Out[13]:

    0   1   2   3   4

c   0   1   2   3   4

a   5   7   9  11  13

d  15  18  21  24  27

b  30  34  38  42  46

滚动计算

.rolling(w).sum()依次计算相邻w个元素的和

.rolling(w).mean()算术平均值

.rolling(w).var()方差

.rolling(w).std()标准差

.rolling(w).min() .max()最小值、最大值

b.rolling(2).sum()

Out[14]:

      0     1     2     3     4

c   NaN   NaN   NaN   NaN   NaN

a   5.0   7.0   9.0  11.0  13.0

d  15.0  17.0  19.0  21.0  23.0

b  25.0  27.0  29.0  31.0  33.0

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）的更多相关文章

Python数据分析与展示第2周学习笔记（北理工嵩天）
单元4:Matplotlib库入门 matplotlib.pyplot是绘制各类可视化图形的命令子库,相当于快捷方式 import matplotlib.pyplot as plt # -*- cod ...
Python数据分析与展示[第三周]（pandas简介与数据创建）
第三周的课程pandas 分析数据 http://pandas.pydata.org import pandas as pd 常与numpy matplotlib 一块定义 d=pd.Series(r ...
Python数据分析与展示[第三周]（pandas数据类型操作）
数据类型操作如何改变Series/ DataFrame 对象增加或重排:重新索引删除:drop 重新索引 .reindex() reindex() 能够改变或重排Series和DataFrame ...
Python数据分析与展示[第三周]（pandas数据特征分析单元8）
数据理解基本统计分布/累计统计数据特征数据挖掘数据排序操作索引的排序 .sort_index() 在指定轴上排序,默认升序参数 axis=0 column ascending=True ...
Python数据分析与展示(1)-数据分析之表示(1)-NumPy库入门
Numpy库入门从一个数据到一组数据维度:一组数据的组织形式一维数据:由对等关系的有序或无序数据构成,采用线性方式组织. 可用类型:对应列表.数组和集合不同点: 列表:数据类型可以不同数组: ...
20145213《Java程序设计》第八周学习笔记
20145213<Java程序设计>第八周学习笔记教材学习内容总结 "桃花春欲尽,谷雨夜来收"谷雨节气的到来意味着寒潮天气的基本结束,气温回升加快.刚出冬的我对于这种 ...
《Linux内核分析》第八周学习笔记
<Linux内核分析>第八周学习笔记进程的切换和系统的一般执行过程郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163 ...
《Linux内核分析》第七周学习笔记
<Linux内核分析>第七周学习笔记可执行程序的装载郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163.com/co ...
《Linux内核分析》第六周学习笔记
<Linux内核分析>第六周学习笔记进程的描述和创建郭垚原创作品转载请注明出处 <Linux内核分析>MOOC课程http://mooc.study.163.com/co ...

随机推荐

多线程进阶——JUC并发编程之CountDownLatch源码一探究竟
1.学习切入点 JDK的并发包中提供了几个非常有用的并发工具类. CountDownLatch. CyclicBarrier和 Semaphore工具类提供了一种并发流程控制的手段.本文将介绍Coun ...
使用H5搭建webapp主页面
使用H5搭建webapp主页面前言: 在一个h5和微信小程序火热的时代,作为安卓程序员也得涉略一下h5了,不然就要落后了,据说在简历上可以加分哦,如果没有html和css和js基础的朋友,可以自行先 ...
Tensorflow——用openpose进行人体骨骼检测
https://blog.csdn.net/eereere/article/details/80176007 参考资料code:https://github.com/ildoonet/tf-pose- ...
eclipse上部署到tomcat不能自动部署maven管理的额jar包
题解 Luogu P2499: [SDOI2012]象棋
关于这道题, 我们可以发现移动顺序不会改变答案, 具体来说, 我们有以下引理成立: 对于一个移动过程中的任意一个移动, 若其到达的位置上有一个棋子, 则该方案要么不能将所有棋子移动到最终位置, 要么可 ...
初学者学习JavaScript的实用技巧！
Javascript是一种高级编程语言,通过解释执行.它是一门动态类型,面向对象(基于原型)的直译语言.它已经由欧洲电脑制造商协会通过ECMAScript实现语言标准化,它被世界上的绝大多数网站所使用 ...
A - Shortest path of the king (棋盘)
The king is left alone on the chessboard. In spite of this loneliness, he doesn't lose heart, becaus ...
python pandas字符串函数详解（转）
pandas字符串函数详解(转)——原文连接见文章末尾在使用pandas框架的DataFrame的过程中,如果需要处理一些字符串的特性,例如判断某列是否包含一些关键字,某列的字符长度是否小于3等等 ...
redis在linux中的安装启动
1. 拖到 /usr/local 下 2. 解压 tar zxf redis-4.0.8.tar.gz 3. mkdir /usr/redis 4. 编译 cd redis-4.0.8/src ...
TS写法
主题句常用句型: ...can/may... ...有助于/帮助.....,(定语从句) ...enable/allows sb. To do... By doing .....,...can.... ...

Python数据分析与展示第3周学习笔记（北京理工大学 嵩天等）

Python数据分析与展示第3周学习笔记（北京理工大学 嵩天等）的更多相关文章

随机推荐

热门专题

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）

Python数据分析与展示第3周学习笔记（北京理工大学嵩天等）的更多相关文章