numpy模块

numpy简介

numpy官方文档：https://docs.scipy.org/doc/numpy/reference/?v=20190307135750

numpy是Python的一种开源的数值计算扩展库。这种库可用来存储和处理大型numpy数组，比Python自身的嵌套列表结构要高效的多（该结构也可以用来表示numpy数组）。

numpy库有两个作用：

区别于list列表，提供了数组操作、数组运算、以及统计分布和简单的数学模型
计算速度快，甚至要由于python内置的简单运算，使得其成为pandas、sklearn等模块的依赖包。高级的框架如TensorFlow、PyTorch等，其数组操作也和numpy非常相似。

为什么要用numpy

lis1=[1,2,3]

lis2=[4,5,6]

当我们想让两个列表内的元素相乘时，如果不用numpy模块，就需要用一个for循环来进行元素的相乘并重新建一个列表来赋值。而使用了numpy则完全不一样了，接下来就让我们欣赏一下numpy模块的功能。

创建numpy数组

numpy数组即numpy的ndarray对象，创建numpy数组就是把一个列表传入np.array()方法

arr=np.array([1,2,3])

print(arr)

arr1=np.array([[1,2,3],[4,5,6]])

print(arr1)

arr2=np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])

print(arr2)

创建一维二维和三维的numpy数组，在numpy中一般用二位比较多，三维及以上用其他模块

numpy数组的基本属性

属性	解释
dtype	数组元素的数据类型
size	数组元素的个数
ndim	数组的维度
shape	数组的维度大小（以元组形式）
astype	类型转换

dtype种类：bool_,int(8,16,32,64),float(16,32,64)

arr=np.array([[1,2,3],[4,5,6]])

print(arr,arr.dtype)

print(arr.T)

print(arr.size)

print(arr.ndim)

print(arr.shape)

[[1 2 3]

[4 5 6]] int32

[[1 4]

[2 5]

[3 6]]

6

2

(2, 3)

获取numpy数组的行列数

由于numpy数组是多维的，对于维的数组而言，numpy数组就是既有行又有列

arr=np.array([[1,2,3],[4,5,6]])

print(arr)

print(arr.shape)

print(arr.shape[0])

print(arr.shape[1])

[[1 2 3]

[4 5 6]]

(2, 3)

2

3

切割numpy数组

切分numpy数组类似于列表的切割，但与列表的切割不同的是，numpy数组的切割涉及到行和列的切割，但是两者的切割方式都是从索引0开始，并且取头不取尾

arr=np.array([[1,2,3],[4,5,6]])

print(arr)

print(arr[:,:])

print(arr[:1,1:])

print(arr[0,:])

print(arr[:,0])

print(arr[arr>5])

[[1 2 3]

[4 5 6]]

[[1 2 3]

[4 5 6]]

[[2 3]]

[1 2 3]

[1 4]

[6]

numpy数组元素替换

numpy数组元素的替换，类似于列表元素的替换，并且numpy数组也是一个可变类型的数据，即如果对numpy数组进行替换操作，会修改原numpy数组的元素，所以下面我们用.copy()方法举例numpy数组元素的替换。

arr=np.array([[1,2,3],[4,5,6]])

arr1=arr.copy()

arr1[:1,:]=0

print(arr1)

arr2=arr.copy()

arr2[arr2>5]=0

print(arr2)

arr[:,:]=0

print(arr)

[[0 0 0]

[4 5 6]]

[[1 2 3]

[4 5 0]]

[[0 0 0]

[0 0 0]]

numpy数组的合并

arr1 = np.array([[1, 2], [3, 4], [5, 6]])

print(arr1)

[[1 2]

 [3 4]

 [5 6]]

arr2 = np.array([[7, 8], [9, 10], [11, 12]])

print(arr2)

[[ 7  8]

 [ 9 10]

 [11 12]]

# 合并两个numpy数组的行，注意使用hstack()方法合并numpy数组，numpy数组应该有相同的行，其中hstack的h表示horizontal水平的

print(np.hstack((arr1, arr2)))

[[ 1  2  7  8]

 [ 3  4  9 10]

 [ 5  6 11 12]]

# 合并两个numpy数组，其中axis=1表示合并两个numpy数组的行

print(np.concatenate((arr1, arr2), axis=1))

[[ 1  2  7  8]

 [ 3  4  9 10]

 [ 5  6 11 12]]

# 合并两个numpy数组的列，注意使用vstack()方法合并numpy数组，numpy数组应该有相同的列，其中vstack的v表示vertical垂直的

print(np.vstack((arr1, arr2)))

[[ 1  2]

 [ 3  4]

 [ 5  6]

 [ 7  8]

 [ 9 10]

 [11 12]]

# 合并两个numpy数组，其中axis=0表示合并两个numpy数组的列

print(np.concatenate((arr1, arr2), axis=0))

[[ 1  2]

 [ 3  4]

 [ 5  6]

 [ 7  8]

 [ 9 10]

 [11 12]]

方法	详解
array()	将列表转换为数组，可选择显式指定dtype
arange()	range的numpy版，支持浮点数
linspace()	类似arange()，第三个参数为数组长度
zeros()	根据指定形状和dtype创建全0数组
ones()	根据指定形状和dtype创建全1数组
eye()	创建单位矩阵
empty()	创建一个元素全随机的数组
reshape()	重塑形状

array

arr = np.array([1, 2, 3])

print(arr)

[1 2 3]

arange

# 构造0-9的ndarray数组

print(np.arange(10))

[0 1 2 3 4 5 6 7 8 9]

# 构造1-4的ndarray数组

print(np.arange(1, 5))

[1 2 3 4]

# 构造1-19且步长为2的ndarray数组

print(np.arange(1, 20, 2))

[ 1  3  5  7  9 11 13 15 17 19]

linspace/logspace

# 构造一个等差数列，取头也取尾，从0取到20，取5个数

print(np.linspace(0, 20, 5))

[ 0.  5. 10. 15. 20.]

# 构造一个等比数列，从10**0取到10**20，取5个数

print(np.logspace(0, 20, 5))

[1.e+00 1.e+05 1.e+10 1.e+15 1.e+20]

zeros/ones/eye/empty

# 构造3*4的全0numpy数组

print(np.zeros((3, 4)))

[[0. 0. 0. 0.]

 [0. 0. 0. 0.]

 [0. 0. 0. 0.]]

# 构造3*4的全1numpy数组

print(np.ones((3, 4)))

[[1. 1. 1. 1.]

 [1. 1. 1. 1.]

 [1. 1. 1. 1.]]

# 构造3个主元的单位numpy数组

print(np.eye(3))

[[1. 0. 0.]

 [0. 1. 0.]

 [0. 0. 1.]]

# 构造一个4*4的随机numpy数组，里面的元素是随机生成的

print(np.empty((4, 4)))

[[ 2.31584178e+077 -1.49457545e-154  3.95252517e-323  0.00000000e+000]

 [ 0.00000000e+000  0.00000000e+000  0.00000000e+000  0.00000000e+000]

 [ 0.00000000e+000  0.00000000e+000  0.00000000e+000  0.00000000e+000]

 [ 0.00000000e+000  0.00000000e+000  1.29074055e-231  1.11687366e-308]]

reshape

arr = np.ones([2, 2], dtype=int)

print(arr.reshape(4, 1))

[[1]

 [1]

 [1]

 [1]]

fromstring/fromfunction(了解)

# fromstring通过对字符串的字符编码所对应ASCII编码的位置，生成一个ndarray对象

s = 'abcdef'

# np.int8表示一个字符的字节数为8

print(np.fromstring(s, dtype=np.int8))

[ 97  98  99 100 101 102]

/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead

  after removing the cwd from sys.path.

def func(i, j):

    """其中i为numpy数组的行，j为numpy数组的列"""

    return i * j

# 使用函数对numpy数组元素的行和列的索引做处理，得到当前元素的值，索引从0开始，并构造一个3*4的numpy数组

print(np.fromfunction(func, (3, 4)))

[[0. 0. 0. 0.]

 [0. 1. 2. 3.]

 [0. 2. 4. 6.]]

numpy数组的运算

运算符	说明
+	两个numpy数组对应元素相加
-	两个numpy数组对应元素相减
*	两个numpy数组对应元素相乘
/	两个numpy数组对应元素相除，如果都是整数则取商
%	两个numpy数组对应元素相除后取余数
**n	单个numpy数组每个元素都取n次方，如**2：每个元素都取平方

arrarr1 = np.array([[1, 2], [3, 4], [5, 6]])

print(arr1)

[[1 2]

 [3 4]

 [5 6]]

arr2 = np.array([[7, 8], [9, 10], [11, 12]])

print(arr2)

[[ 7  8]

 [ 9 10]

 [11 12]]

print(arr1 + arr2)

[[ 8 10]

 [12 14]

 [16 18]]

print(arr1**2)

[[ 1  4]

 [ 9 16]

 [25 36]]

numpy数组运算函数

numpy数组函数	详解
np.sin(arr)	对numpy数组arr中每个元素取正弦，sin(x)sin(x)
np.cos(arr)	对numpy数组arr中每个元素取余弦，cos(x)cos(x)
np.tan(arr)	对numpy数组arr中每个元素取正切，tan(x)tan(x)
np.arcsin(arr)	对numpy数组arr中每个元素取反正弦，arcsin(x)arcsin(x)
np.arccos(arr)	对numpy数组arr中每个元素取反余弦，arccos(x)arccos(x)
np.arctan(arr)	对numpy数组arr中每个元素取反正切，arctan(x)arctan(x)
np.exp(arr)	对numpy数组arr中每个元素取指数函数，exex
np.sqrt(arr)	对numpy数组arr中每个元素开根号x

一元函数：abs, sqrt, exp, log, ceil, floor, rint, trunc, modf, isnan, isinf, cos, sin, tan

二元函数：add, substract, multiply, divide, power, mod, maximum, mininum

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print(arr)

[[ 1  2  3  4]

 [ 5  6  7  8]

 [ 9 10 11 12]]

# 对numpy数组的所有元素取正弦

print(np.sin(arr))

[[ 0.84147098  0.90929743  0.14112001 -0.7568025 ]

 [-0.95892427 -0.2794155   0.6569866   0.98935825]

 [ 0.41211849 -0.54402111 -0.99999021 -0.53657292]]

# 对numpy数组的所有元素开根号

print(np.sqrt(arr))

[[1.         1.41421356 1.73205081 2.        ]

 [2.23606798 2.44948974 2.64575131 2.82842712]

 [3.         3.16227766 3.31662479 3.46410162]]

# 对numpy数组的所有元素取反正弦，如果元素不在定义域内，则会取nan值

print(np.arcsin(arr * 0.1))

[[0.10016742 0.20135792 0.30469265 0.41151685]

 [0.52359878 0.64350111 0.7753975  0.92729522]

 [1.11976951 1.57079633        nan        nan]]

/Applications/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in arcsin

# 判断矩阵元素中是否含有np.nan值

print(np.isnan(arr))

[[False False False]

 [False False False]]

numpy数组矩阵化

numpy数组点乘

numpy数组的点乘必须满足第一个numpy数组的列数等于第二个numpy数组的行数，即m∗n⋅n∗m=m∗mm∗n·n∗m=m∗m 。

arr1 = np.array([[1, 2, 3], [4, 5, 6]])

print(arr1.shape)

(2, 3)

arr2 = np.array([[7, 8], [9, 10], [11, 12]])

print(arr2.shape)

(3, 2)

assert arr1.shape[0] == arr2.shape[1]

# 2*3·3*2 = 2*2

print(arr2.shape)

(3, 2)

numpy数组转置

numpy数组的转置，相当于numpy数组的行和列互换。

arr = np.array([[1, 2, 3], [4, 5, 6]])

print(arr)

[[1 2 3]

 [4 5 6]]

print(arr.transpose())

[[1 4]

 [2 5]

 [3 6]]

print(arr.T)

[[1 4]

 [2 5]

 [3 6]]

nummpy数组的逆

numpy数组行和列相同时，numpy数组才可逆。

arr = np.array([[1, 2, 3], [4, 5, 6], [9, 8, 9]])

print(arr)

[[1 2 3]

 [4 5 6]

 [9 8 9]]

print(np.linalg.inv(arr))

[[ 0.5        -1.          0.5       ]

 [-3.          3.         -1.        ]

 [ 2.16666667 -1.66666667  0.5       ]]

# 单位numpy数组的逆是单位numpy数组本身

arr = np.eye(3)

print(arr)

[[1. 0. 0.]

 [0. 1. 0.]

 [0. 0. 1.]]

print(np.linalg.inv(arr))

[[1. 0. 0.]

 [0. 1. 0.]

 [0. 0. 1.]]

numpy数组数学和统计方法

方法	详解
sum	求和
cumsum	累加求和
mean	求平均数
std	求标准差
var	求方差
min	求最小值
max	求最大值
argmin	求最小值索引
argmax	求最大值索引
sort	排序

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(arr)

[[1 2 3]

 [4 5 6]

 [7 8 9]]

# 获取numpy数组所有元素中的最大值

print(arr.max())

9

# 获取numpy数组所有元素中的最小值

print(arr.min())

1

# 获取举着每一行的最大值

print(arr.max(axis=0))

[7 8 9]

# 获取numpy数组每一列的最大值

print(arr.max(axis=1))

[3 6 9]

# 获取numpy数组最大元素的索引位置

print(arr.argmax(axis=1))

[2 2 2]

numpy.random生成随机数

函数名称	函数功能	参数说明
rand(d0,d1,⋯,dnd0,d1,⋯,dn )	产生均匀分布的随机数	dndn 为第n维数据的维度
randn(d0,d1,⋯,dnd0,d1,⋯,dn )	产生标准正态分布随机数	dndn 为第n维数据的维度
randint(low[, high, size, dtype])	产生随机整数	low:最小值；high:最大值；size:数据个数
random_sample([size])	在[0,1)[0,1) 内产生随机数	size为随机数的shape，可以为元祖或者列表
choice(a[, size])	从arr中随机选择指定数据	arr为1维数组；size为数组形状
uniform(low,high [,size])	给定形状产生随机数组	low为最小值；high为最大值，size为数组形状
shuffle(a)	与random.shuffle相同	a为指定数组

pandas模块

pandas官方文档：https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750

pandas基于Numpy，可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构，其中Series数据结构类似于Numpy中的一维数组，DataFrame类似于多维表格数据结构。

pandas是python数据分析的核心模块。它主要提供了五大功能:

支持文件存取操作，支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
支持增删改查、切片、高阶函数、分组聚合等单表操作，以及和dict、list的互相转换。
支持多表拼接合并操作。
支持简单的绘图操作。
支持简单的统计分析操作。

一、Series数据结构

Series是一种类似于一维数组的对象，由一组数据和一组与之相关的数据标签（索引）组成。

Series比较像列表（数组）和字典的结合体

import numpy as np

import pandas as pd

df = pd.Series(0, index=['a', 'b', 'c', 'd'])

print(df)

a    0

b    0

c    0

d    0

dtype: int64

print(df.values)

[0 0 0 0]

print(df.index)

Index(['a', 'b', 'c', 'd'], dtype='object')

1.1 Series支持NumPy模块的特性（下标）

详解	方法
从ndarray创建Series	Series(arr)
与标量运算	df*2
两个Series运算	df1+df2
索引	df[0], df[[1,2,4]]
切片	df[0:2]
通用函数	np.abs(df)
布尔值过滤	df[df>0]

arr = np.array([1, 2, 3, 4, np.nan])

print(arr)

[ 1.  2.  3.  4. nan]

df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])

print(df)

a    1.0

b    2.0

c    3.0

d    4.0

e    NaN

dtype: float64

print(df**2)

a     1.0

b     4.0

c     9.0

d    16.0

e     NaN

dtype: float64

print(df[0])

1.0

print(df['a'])

1.0

print(df[[0, 1, 2]])

a    1.0

b    2.0

c    3.0

dtype: float64

print(df[0:2])

a    1.0

b    2.0

dtype: float64

np.sin(df)

a    0.841471

b    0.909297

c    0.141120

d   -0.756802

e         NaN

dtype: float64

df[df > 1]

b    2.0

c    3.0

d    4.0

dtype: float64

1.2 Series支持字典的特性（标签）

详解	方法
从字典创建Series	Series(dic),
in运算	’a’ in sr
键索引	sr['a'], sr[['a', 'b', 'd']]

df = pd.Series({'a': 1, 'b': 2})

print(df)

a    1

b    2

dtype: int64

print('a' in df)

True

print(df['a'])

1

1.3 Series缺失数据处理

方法	详解
dropna()	过滤掉值为NaN的行
fillna()	填充缺失数据
isnull()	返回布尔数组，缺失值对应为True
notnull()	返回布尔数组，缺失值对应为False

df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e'])

print(df)

a    1.0

b    2.0

c    3.0

d    4.0

e    NaN

dtype: float64

print(df.dropna())

a    1.0

b    2.0

c    3.0

d    4.0

dtype: float64

print(df.fillna(5))

a    1.0

b    2.0

c    3.0

d    4.0

e    5.0

dtype: float64

print(df.isnull())

a    False

b    False

c    False

d    False

e     True

dtype: bool

print(df.notnull())

a     True

b     True

c     True

d     True

e    False

dtype: bool

二、DataFrame数据结构

DataFrame是一个表格型的数据结构，含有一组有序的列。

DataFrame可以被看做是由Series组成的字典，并且共用一个索引。

2.1 产生时间对象数组：date_range

date_range参数详解：

参数	详解
start	开始时间
end	结束时间
periods	时间长度
freq	时间频率，默认为'D'，可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),…

dates = pd.date_range('20190101', periods=6, freq='M')

print(dates)

DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',

               '2019-05-31', '2019-06-30'],

              dtype='datetime64[ns]', freq='M')

np.random.seed(1)

arr = 10 * np.random.randn(6, 4)

print(arr)

[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]

 [  8.65407629 -23.01538697  17.44811764  -7.61206901]

 [  3.19039096  -2.49370375  14.62107937 -20.60140709]

 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]

 [ -1.72428208  -8.77858418   0.42213747   5.82815214]

 [-11.00619177  11.4472371    9.01590721   5.02494339]]

df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

三、DataFrame属性

属性	详解
dtype是	查看数据类型
index	查看行序列或者索引
columns	查看各列的标签
values	查看数据框内的数据，也即不含表头索引的数据
describe	查看数据每一列的极值，均值，中位数，只可用于数值型数据
transpose	转置，也可用Ｔ来操作
sort_index	排序，可按行或列index排序输出
sort_values	按数据值来排序

# 查看数据类型

print(df2.dtypes)

0    float64

1    float64

2    float64

3    float64

dtype: object

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

print(df.index)

DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',

               '2019-05-31', '2019-06-30'],

              dtype='datetime64[ns]', freq='M')

print(df.columns)

Index(['c1', 'c2', 'c3', 'c4'], dtype='object')

print(df.values)

[[ 16.24345364  -6.11756414  -5.28171752 -10.72968622]

 [  8.65407629 -23.01538697  17.44811764  -7.61206901]

 [  3.19039096  -2.49370375  14.62107937 -20.60140709]

 [ -3.22417204  -3.84054355  11.33769442 -10.99891267]

 [ -1.72428208  -8.77858418   0.42213747   5.82815214]

 [-11.00619177  11.4472371    9.01590721   5.02494339]]

df.describe()

	c1	c2	c3	c4
count	6.000000	6.000000	6.000000	6.000000
mean	2.022213	-5.466424	7.927203	-6.514830
std	9.580084	11.107772	8.707171	10.227641
min	-11.006192	-23.015387	-5.281718	-20.601407
25%	-2.849200	-8.113329	2.570580	-10.931606
50%	0.733054	-4.979054	10.176801	-9.170878
75%	7.288155	-2.830414	13.800233	1.865690
max	16.243454	11.447237	17.448118	5.828152

df.T

	2019-01-31 00:00:00	2019-02-28 00:00:00	2019-03-31 00:00:00	2019-04-30 00:00:00	2019-05-31 00:00:00	2019-06-30 00:00:00
c1	16.243454	8.654076	3.190391	-3.224172	-1.724282	-11.006192
c2	-6.117564	-23.015387	-2.493704	-3.840544	-8.778584	11.447237
c3	-5.281718	17.448118	14.621079	11.337694	0.422137	9.015907
c4	-10.729686	-7.612069	-20.601407	-10.998913	5.828152	5.024943

# 按行标签[c1, c2, c3, c4]从大到小排序

df.sort_index(axis=0)

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 按列标签[2019-01-01, 2019-01-02...]从大到小排序

df.sort_index(axis=1)

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 按c2列的值从大到小排序

df.sort_values(by='c2')

	c1	c2	c3	c4
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-06-30	-11.006192	11.447237	9.015907	5.024943

四、DataFrame取值

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

4.1 通过columns取值

df['c2']

2019-01-31    -6.117564

2019-02-28   -23.015387

2019-03-31    -2.493704

2019-04-30    -3.840544

2019-05-31    -8.778584

2019-06-30    11.447237

Freq: M, Name: c2, dtype: float64

df[['c2', 'c3']]

	c2	c3
2019-01-31	-6.117564	-5.281718
2019-02-28	-23.015387	17.448118
2019-03-31	-2.493704	14.621079
2019-04-30	-3.840544	11.337694
2019-05-31	-8.778584	0.422137
2019-06-30	11.447237	9.015907

4.2 loc（通过行标签取值）

# 通过自定义的行标签选择数据

df.loc['2019-01-01':'2019-01-03']

	c1	c2	c3	c4

df[0:3]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

4.3 iloc（类似于numpy数组取值）

df.values

array([[ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622],

       [  8.65407629, -23.01538697,  17.44811764,  -7.61206901],

       [  3.19039096,  -2.49370375,  14.62107937, -20.60140709],

       [ -3.22417204,  -3.84054355,  11.33769442, -10.99891267],

       [ -1.72428208,  -8.77858418,   0.42213747,   5.82815214],

       [-11.00619177,  11.4472371 ,   9.01590721,   5.02494339]])

# 通过行索引选择数据

print(df.iloc[2, 1])

-2.493703754774101

df.iloc[1:4, 1:4]

	c2	c3	c4
2019-02-28	-23.015387	17.448118	-7.612069
2019-03-31	-2.493704	14.621079	-20.601407
2019-04-30	-3.840544	11.337694	-10.998913

4.4 使用逻辑判断取值

df[df['c1'] > 0]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

df[(df['c1'] > 0) & (df['c2'] > -8)]

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-03-31	3.190391	-2.493704	14.621079	-20.601407

五、DataFrame值替换

df

	c1	c2	c3	c4
2019-01-31	16.243454	-6.117564	-5.281718	-10.729686
2019-02-28	8.654076	-23.015387	17.448118	-7.612069
2019-03-31	3.190391	-2.493704	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

df.iloc[0:3, 0:2] = 0

df

	c1	c2	c3	c4
2019-01-31	0.000000	0.000000	-5.281718	-10.729686
2019-02-28	0.000000	0.000000	17.448118	-7.612069
2019-03-31	0.000000	0.000000	14.621079	-20.601407
2019-04-30	-3.224172	-3.840544	11.337694	-10.998913
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

df['c3'] > 10

2019-01-31    False

2019-02-28     True

2019-03-31     True

2019-04-30     True

2019-05-31    False

2019-06-30    False

Freq: M, Name: c3, dtype: bool

# 针对行做处理

df[df['c3'] > 10] = 100

df

	c1	c2	c3	c4
2019-01-31	0.000000	0.000000	-5.281718	-10.729686
2019-02-28	100.000000	100.000000	100.000000	100.000000
2019-03-31	100.000000	100.000000	100.000000	100.000000
2019-04-30	100.000000	100.000000	100.000000	100.000000
2019-05-31	-1.724282	-8.778584	0.422137	5.828152
2019-06-30	-11.006192	11.447237	9.015907	5.024943

# 针对行做处理

df = df.astype(np.int32)

df[df['c3'].isin([100])] = 1000

df

	c1	c2	c3	c4
2019-01-31	0	0	-5	-10
2019-02-28	1000	1000	1000	1000
2019-03-31	1000	1000	1000	1000
2019-04-30	1000	1000	1000	1000
2019-05-31	-1	-8	0	5
2019-06-30	-11	11	9	5

六、读取CSV文件

import pandas as pd

from io import StringIO

test_data = '''

5.1,,1.4,0.2

4.9,3.0,1.4,0.2

4.7,3.2,,0.2

7.0,3.2,4.7,1.4

6.4,3.2,4.5,1.5

6.9,3.1,4.9,

,,,

'''

test_data = StringIO(test_data)

df = pd.read_csv(test_data, header=None)

df.columns = ['c1', 'c2', 'c3', 'c4']

df

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

七、处理丢失数据

df.isnull()

	c1	c2	c3	c4
0	False	True	False	False
1	False	False	False	False
2	False	False	True	False
3	False	False	False	False
4	False	False	False	False
5	False	False	False	True
6	True	True	True	True

# 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值

print(df.isnull().sum())

c1    1

c2    2

c3    2

c4    2

dtype: int64

# axis=0删除有NaN值的行

df.dropna(axis=0)

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5

# axis=1删除有NaN值的列

df.dropna(axis=1)


0
1
2
3
4
5
6

# 删除全为NaN值得行或列

df.dropna(how='all')

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

# 删除行不为4个值的

df.dropna(thresh=4)

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5

# 删除c2中有NaN值的行

df.dropna(subset=['c2'])

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

# 填充nan值

df.fillna(value=10)

	c1	c2	c3	c4
0	5.1	10.0	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	10.0	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	10.0
6	10.0	10.0	10.0	10.0

八、合并数据

df1 = pd.DataFrame(np.zeros((3, 4)))

df1

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0

df2 = pd.DataFrame(np.ones((3, 4)))

df2

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# axis=0合并列

pd.concat((df1, df2), axis=0)

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# axis=1合并行

pd.concat((df1, df2), axis=1)

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

# append只能合并列

df1.append(df2)

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

九、导入导出数据

使用df = pd.read_excel(filename)读取文件，使用df.to_excel(filename)保存文件。

9.1 读取文件导入数据

读取文件导入数据函数主要参数：

参数	详解
sep	指定分隔符，可用正则表达式如'\s+'
header=None	指定文件无行名
name	指定列名
index_col	指定某列作为索引
skip_row	指定跳过某些行
na_values	指定某些字符串表示缺失值
parse_dates	指定某些列是否被解析为日期，布尔值或列表

df = pd.read_excel(filename)

df = pd.read_csv(filename)

9.2 写入文件导出数据

写入文件函数的主要参数：

参数	详解
sep	分隔符
na_rep	指定缺失值转换的字符串，默认为空字符串
header=False	不保存列名
index=False	不保存行索引
cols	指定输出的列，传入列表

df.to_excel(filename)

十、pandas读取json文件

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\

{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\

{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\

{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\

{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

df = pd.read_json(strtext, orient='records')

df

	code	code1	code2	issue	time	ttery
0	8,4,5,2,9	297734529	NaN	20130801-3391	1013395466000	min
1	7,8,2,1,2	298058212	NaN	20130801-3390	1013395406000	min
2	5,9,1,2,9	298329129	NaN	20130801-3389	1013395346000	min
3	3,8,7,3,3	298588733	NaN	20130801-3388	1013395286000	min
4	0,8,5,2,7	298818527	NaN	20130801-3387	1013395226000	min

df.to_excel('pandas处理json.xlsx',

            index=False,

            columns=["ttery", "issue", "code", "code1", "code2", "time"])

10.1 orient参数的五种形式

orient是表明预期的json字符串格式。orient的设置有以下五个值：

1.'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

这种就是有索引，有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'

df = pd.read_json(s, orient='split')

df

	a	b
1	1	3
2	2	8
3	3	9

2.'records' : list like [{column -> value}, ... , {column -> value}]

这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\

{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'

df = pd.read_json(strtext, orient='records')

df

	code	code1	code2	issue	time	ttery
0	8,4,5,2,9	297734529	NaN	20130801-3391	1013395466000	min
1	7,8,2,1,2	298058212	NaN	20130801-3390	1013395406000	min

3.'index' : dict like {index -> {column -> value}}

以索引为key,以列字段构成的字典为键值。如：

s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'

df = pd.read_json(s, orient='index')

df

	a	b
0	1	2
1	9	11

4.'columns' : dict like {column -> {index -> value}}

这种处理的就是以列为键，对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'

df = pd.read_json(s, orient='columns')

df

	a	b
0	1	2
1	9	11

5.'values' : just the values array。

values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表，2层的。

s = '[["a",1],["b",2]]'

df = pd.read_json(s, orient='values')

df

	0	1
0	a	1
1	b	2

十一、pandas读取sql语句

import numpy as np

import pandas as pd

import pymysql

def conn(sql):

    # 连接到mysql数据库

    conn = pymysql.connect(

        host="localhost",

        port=3306,

        user="root",

        passwd="123",

        db="db1",

    )

    try:

        data = pd.read_sql(sql, con=conn)

        return data

    except Exception as e:

        print("SQL is not correct!")

    finally:

        conn.close()

sql = "select * from test1 limit 0, 10"  # sql语句

data = conn(sql)

print(data.columns.tolist())  # 查看字段

print(data)  # 查看数据

matplotlib模块

matplotlib官方文档：https://matplotlib.org/contents.html?v=20190307135750

matplotlib是一个绘图库，它可以创建常用的统计图，包括条形图、箱型图、折线图、散点图、饼图和直方图。

一、条形图

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹

plt.style.use('ggplot')

classes = ['3班', '4班', '5班', '6班']

classes_index = range(len(classes))

print(list(classes_index))

[0, 1, 2, 3]

student_amounts = [66, 55, 45, 70]

# 画布设置

fig = plt.figure()

# 1,1,1表示一张画布切割成1行1列共一张图的第1个；2,2,1表示一张画布切割成2行2列共4张图的第一个（左上角）

ax1 = fig.add_subplot(1, 1, 1)

ax1.bar(classes_index, student_amounts, align='center', color='darkblue')

ax1.xaxis.set_ticks_position('bottom')

ax1.yaxis.set_ticks_position('left')

plt.xticks(classes_index,

           classes,

           rotation=0,

           fontsize=13,

           fontproperties=font)

plt.xlabel('班级', fontproperties=font, fontsize=15)

plt.ylabel('学生人数', fontproperties=font, fontsize=15)

plt.title('班级-学生人数', fontproperties=font, fontsize=20)

# 保存图片，bbox_inches='tight'去掉图形四周的空白

# plt.savefig('classes_students.png?x-oss-process=style/watermark', dpi=400, bbox_inches='tight')

plt.show()

二、直方图

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹

plt.style.use('ggplot')

mu1, mu2, sigma = 50, 100, 10

# 构造均值为50的符合正态分布的数据

x1 = mu1 + sigma * np.random.randn(10000)

print(x1)

[59.00855949 43.16272141 48.77109774 ... 57.94645859 54.70312714

 58.94125528]

# 构造均值为100的符合正态分布的数据

x2 = mu2 + sigma * np.random.randn(10000)

print(x2)

[115.19915511  82.09208214 110.88092454 ...  95.0872103  104.21549068

 133.36025251]

fig = plt.figure()

ax1 = fig.add_subplot(121)

# bins=50表示每个变量的值分成50份，即会有50根柱子

ax1.hist(x1, bins=50, color='darkgreen')

ax2 = fig.add_subplot(122)

ax2.hist(x2, bins=50, color='orange')

fig.suptitle('两个正态分布', fontproperties=font, fontweight='bold', fontsize=15)

ax1.set_title('绿色的正态分布', fontproperties=font)

ax2.set_title('橙色的正态分布', fontproperties=font)

plt.show()

三、折线图

import numpy as np

from numpy.random import randn

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹

plt.style.use('ggplot')

np.random.seed(1)

# 使用numpy的累加和，保证数据取值范围不会在（0，1）内波动

plot_data1 = randn(40).cumsum()

print(plot_data1)

[ 1.62434536  1.01258895  0.4844172  -0.58855142  0.2768562  -2.02468249

 -0.27987073 -1.04107763 -0.72203853 -0.97140891  0.49069903 -1.56944168

 -1.89185888 -2.27591324 -1.1421438  -2.24203506 -2.41446327 -3.29232169

 -3.25010794 -2.66729273 -3.76791191 -2.6231882  -1.72159748 -1.21910314

 -0.31824719 -1.00197505 -1.12486527 -2.06063471 -2.32852279 -1.79816732

 -2.48982807 -2.8865816  -3.5737543  -4.41895994 -5.09020607 -5.10287067

 -6.22018102 -5.98576532 -4.32596314 -3.58391898]

plot_data2 = randn(40).cumsum()

plot_data3 = randn(40).cumsum()

plot_data4 = randn(40).cumsum()

plt.plot(plot_data1, marker='o', color='red', linestyle='-', label='红实线')

plt.plot(plot_data2, marker='x', color='orange', linestyle='--', label='橙虚线')

plt.plot(plot_data3, marker='*', color='yellow', linestyle='-.', label='黄点线')

plt.plot(plot_data4, marker='s', color='green', linestyle=':', label='绿点图')

# loc='best'给label自动选择最好的位置

plt.legend(loc='best', prop=font)

plt.show()

四、散点图+直线图

import numpy as np

from numpy.random import randn

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

# 修改背景为条纹

plt.style.use('ggplot')

x = np.arange(1, 20, 1)

print(x)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

# 拟合一条水平散点线

np.random.seed(1)

y_linear = x + 10 * np.random.randn(19)

print(y_linear)

[ 17.24345364  -4.11756414  -2.28171752  -6.72968622  13.65407629

 -17.01538697  24.44811764   0.38793099  12.19039096   7.50629625

  25.62107937  -8.60140709   9.77582796  10.15945645  26.33769442

   5.00108733  15.27571792   9.22141582  19.42213747]

# 拟合一条x²的散点线

y_quad = x**2 + 10 * np.random.randn(19)

print(y_quad)

[  6.82815214  -7.00619177  20.4472371   25.01590721  30.02494339

  45.00855949  42.16272141  62.77109774  71.64230566  97.3211192

 126.30355467 137.08339248 165.03246473 189.128273   216.54794359

 249.28753869 288.87335401 312.82689651 363.34415698]

# s是散点大小

fig = plt.figure()

ax1 = fig.add_subplot(121)

plt.scatter(x, y_linear, s=30, color='r', label='蓝点')

plt.scatter(x, y_quad, s=100, color='b', label='红点')

ax2 = fig.add_subplot(122)

plt.plot(x, y_linear, color='r')

plt.plot(x, y_quad, color='b')

# 限制x轴和y轴的范围取值

plt.xlim(min(x) - 1, max(x) + 1)

plt.ylim(min(y_quad) - 10, max(y_quad) + 10)

fig.suptitle('散点图+直线图', fontproperties=font, fontsize=20)

ax1.set_title('散点图', fontproperties=font)

ax1.legend(prop=font)

ax2.set_title('直线图', fontproperties=font)

plt.show()

五、饼图

import numpy as np

import matplotlib.pyplot as plt

from pylab import mpl

mpl.rcParams['font.sans-serif'] = ['SimHei']

fig, ax = plt.subplots(subplot_kw=dict(aspect="equal"))

recipe = ['优', '良', '轻度污染', '中度污染', '重度污染', '严重污染', '缺']

data = [2, 49, 21, 9, 11, 6, 2]

colors = ['lime', 'yellow', 'darkorange', 'red', 'purple', 'maroon', 'grey']

wedges, texts, texts2 = ax.pie(data,

                               wedgeprops=dict(width=0.5),

                               startangle=40,

                               colors=colors,

                               autopct='%1.0f%%',

                               pctdistance=0.8)

plt.setp(texts2, size=14, weight="bold")

bbox_props = dict(boxstyle="square,pad=0.3", fc="w", ec="k", lw=0.72)

kw = dict(xycoords='data',

          textcoords='data',

          arrowprops=dict(arrowstyle="->"),

          bbox=None,

          zorder=0,

          va="center")

for i, p in enumerate(wedges):

    ang = (p.theta2 - p.theta1) / 2. + p.theta1

    y = np.sin(np.deg2rad(ang))

    x = np.cos(np.deg2rad(ang))

    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]

    connectionstyle = "angle,angleA=0,angleB={}".format(ang)

    kw["arrowprops"].update({"connectionstyle": connectionstyle})

    ax.annotate(recipe[i],

                xy=(x, y),

                xytext=(1.25 * np.sign(x), 1.3 * y),

                size=16,

                horizontalalignment=horizontalalignment,

                fontproperties=font,

                **kw)

ax.set_title("饼图示例",fontproperties=font)

plt.show()

# plt.savefig('jiaopie2.png?x-oss-process=style/watermark')

六、箱型图

箱型图：又称为盒须图、盒式图、盒状图或箱线图，是一种用作显示一组数据分散情况资料的统计图（在数据分析中常用在异常值检测）

包含一组数据的：最大值、最小值、中位数、上四分位数（Q3）、下四分位数（Q1）、异常值

中位数 → 一组数据平均分成两份，中间的数
上四分位数Q1 → 是将序列平均分成四份，计算(n+1)/4与(n-1)/4两种，一般使用(n+1)/4
下四分位数Q3 → 是将序列平均分成四份，计算(1+n)/4*3=6.75
内限 → T形的盒须就是内限，最大值区间Q3+1.5IQR,最小值区间Q1-1.5IQR （IQR=Q3-Q1）
外限 → T形的盒须就是内限，最大值区间Q3+3IQR,最小值区间Q1-3IQR （IQR=Q3-Q1）
异常值 → 内限之外 - 中度异常，外限之外 - 极度异常

import numpy as np

import pandas as pd

from numpy.random import randn

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])

plt.figure(figsize=(10, 4))

# 创建图表、数据

f = df.boxplot(

    sym='o',  # 异常点形状，参考marker

    vert=True,  # 是否垂直

    whis=1.5,  # IQR，默认1.5，也可以设置区间比如[5,95]，代表强制上下边缘为数据95%和5%位置

    patch_artist=True,  # 上下四分位框内是否填充，True为填充

    meanline=False,

    showmeans=True,  # 是否有均值线及其形状

    showbox=True,  # 是否显示箱线

    showcaps=True,  # 是否显示边缘线

    showfliers=True,  # 是否显示异常值

    notch=False,  # 中间箱体是否缺口

    return_type='dict'  # 返回类型为字典

)

plt.title('boxplot')

for box in f['boxes']:

    box.set(color='b', linewidth=1)  # 箱体边框颜色

    box.set(facecolor='b', alpha=0.5)  # 箱体内部填充颜色

for whisker in f['whiskers']:

    whisker.set(color='k', linewidth=0.5, linestyle='-')

for cap in f['caps']:

    cap.set(color='gray', linewidth=2)

for median in f['medians']:

    median.set(color='DarkBlue', linewidth=2)

for flier in f['fliers']:

    flier.set(marker='o', color='y', alpha=0.5)

# boxes, 箱线

# medians, 中位值的横线,

# whiskers, 从box到error bar之间的竖线.

# fliers, 异常值

# caps, error bar横线

# means, 均值的横线

七、plot函数参数

线型linestyle（-,-.,--,..）
点型marker（v,^,s,*,H,+,x,D,o,…）
颜色color（b,g,r,y,k,w,…）

八、图像标注参数

设置图像标题：plt.title()
设置x轴名称：plt.xlabel()
设置y轴名称：plt.ylabel()
设置X轴范围：plt.xlim()
设置Y轴范围：plt.ylim()
设置X轴刻度：plt.xticks()
设置Y轴刻度：plt.yticks()
设置曲线图例：plt.legend()

九、Matplolib应用

import pandas as pd

import matplotlib.pyplot as plt

from matplotlib.font_manager import FontProperties

%matplotlib inline

# 找到自己电脑的字体路径，然后修改字体路径

font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

header_list = ['方程组', '函数', '导数', '微积分', '线性代数', '概率论', '统计学']

py3_df = pd.read_excel('py3.xlsx', header=None,

                       skiprows=[0, 1], names=header_list)

# 处理带有NaN的行

py3_df = py3_df.dropna(axis=0)

print(py3_df)

# 自定义映射

map_dict = {

    '不会': 0,

    '了解': 1,

    '熟悉': 2,

    '使用过': 3,

}

for header in header_list:

    py3_df[header] = py3_df[header].map(map_dict)

unable_series = (py3_df == 0).sum(axis=0)

know_series = (py3_df == 1).sum(axis=0)

familiar_series = (py3_df == 2).sum(axis=0)

use_series = (py3_df == 3).sum(axis=0)

unable_label = '不会'

know_label = '了解'

familiar_label = '熟悉'

use_label = '使用过'

for i in range(len(header_list)):

    bottom = 0

    # 描绘不会的条形图

    plt.bar(x=header_list[i], height=unable_series[i],

            width=0.60, color='r', label=unable_label)

    if unable_series[i] != 0:

        plt.text(header_list[i], bottom, s=unable_series[i],

                 ha='center', va='bottom', fontsize=15, color='white')

    bottom += unable_series[i]

    # 描绘了解的条形图

    plt.bar(x=header_list[i], height=know_series[i],

            width=0.60, color='y', bottom=bottom, label=know_label)

    if know_series[i] != 0:

        plt.text(header_list[i], bottom, s=know_series[i],

                 ha='center', va='bottom', fontsize=15, color='white')

    bottom += know_series[i]

    # 描绘熟悉的条形图

    plt.bar(x=header_list[i], height=familiar_series[i],

            width=0.60, color='g', bottom=bottom, label=familiar_label)

    if familiar_series[i] != 0:

        plt.text(header_list[i], bottom, s=familiar_series[i],

                 ha='center', va='bottom', fontsize=15, color='white')

    bottom += familiar_series[i]

    # 描绘使用过的条形图

    plt.bar(x=header_list[i], height=use_series[i],

            width=0.60, color='b', bottom=bottom, label=use_label)

    if use_series[i] != 0:

        plt.text(header_list[i], bottom, s=use_series[i],

                 ha='center', va='bottom', fontsize=15, color='white')

    unable_label = know_label = familiar_label = use_label = ''

plt.xticks(header_list, fontproperties=font)

plt.ylabel('人数', fontproperties=font)

plt.title('Python3期数学摸底可视化', fontproperties=font)

plt.legend(prop=font, loc='upper left')

plt.show()

    方程组   函数   导数        微积分       线性代数  概率论  统计学

0   使用过  使用过   不会         不会         不会   不会   不会

1   使用过  使用过   了解         不会         不会   不会   不会

2   使用过  使用过   熟悉         不会         不会   不会   不会

3    熟悉   熟悉   熟悉         了解         了解   了解   了解

4   使用过  使用过  使用过        使用过        使用过  使用过  使用过

5   使用过  使用过  使用过         不会         不会   不会   了解

6    熟悉   熟悉   熟悉         熟悉         熟悉   熟悉   不会

7   使用过  使用过  使用过        使用过        使用过  使用过  使用过

8    熟悉   熟悉   熟悉         熟悉         熟悉  使用过  使用过

9    熟悉   熟悉  使用过         不会        使用过  使用过   不会

10  使用过  使用过   熟悉         熟悉         熟悉   熟悉   熟悉

11  使用过  使用过  使用过        使用过        使用过   不会   不会

12  使用过  使用过  使用过        使用过        使用过  使用过  使用过

13  使用过  使用过   了解         不会         不会   不会   不会

14  使用过  使用过  使用过        使用过        使用过   不会   不会

15  使用过  使用过   熟悉         不会         不会   不会   不会

16   熟悉   熟悉  使用过        使用过        使用过   不会   不会

17  使用过  使用过  使用过         了解         不会   不会   不会

18  使用过  使用过  使用过        使用过         熟悉   熟悉   熟悉

19  使用过  使用过  使用过         了解         不会   不会   不会

20  使用过  使用过  使用过        使用过        使用过  使用过  使用过

21  使用过  使用过  使用过        使用过        使用过  使用过  使用过

22  使用过  很了解   熟悉  了解一点，不会运用  了解一点，不会运用   了解   不会

23  使用过  使用过  使用过        使用过         熟悉  使用过   熟悉

24   熟悉   熟悉   熟悉        使用过         不会   不会   不会

25  使用过  使用过  使用过        使用过        使用过  使用过  使用过

26  使用过  使用过  使用过        使用过        使用过   不会   不会

27  使用过  使用过   不会         不会         不会   不会   不会

28  使用过  使用过  使用过        使用过        使用过  使用过   了解

29  使用过  使用过  使用过        使用过        使用过   了解   不会

30  使用过  使用过  使用过        使用过        使用过   不会   不会

31  使用过  使用过  使用过        使用过         不会  使用过  使用过

32   熟悉   熟悉  使用过        使用过        使用过   不会   不会

33  使用过  使用过  使用过        使用过         熟悉  使用过   熟悉

34   熟悉   熟悉   熟悉        使用过        使用过   熟悉   不会

35  使用过  使用过  使用过        使用过        使用过  使用过  使用过

36  使用过  使用过  使用过        使用过        使用过  使用过   了解

37  使用过  使用过  使用过        使用过        使用过   不会   不会

38  使用过  使用过  使用过         不会         不会   不会   不会

39  使用过  使用过   不会         不会         不会   不会   不会

40  使用过  使用过  使用过        使用过        使用过   不会   不会

41  使用过  使用过   熟悉         了解         了解   了解   不会

42  使用过  使用过  使用过         不会         不会   不会   不会

43   熟悉  使用过   了解         了解         不会   不会   不会

...

hhhhhhhhhge...

python-day18(正式学习)的更多相关文章

Python 装饰器学习
Python装饰器学习(九步入门) 这是在Python学习小组上介绍的内容,现学现卖.多练习是好的学习方式. 第一步:最简单的函数,准备附加额外功能 1 2 3 4 5 6 7 8 # -*- c ...
Requests:Python HTTP Module学习笔记（一）（转）
Requests:Python HTTP Module学习笔记(一) 在学习用python写爬虫的时候用到了Requests这个Http网络库,这个库简单好用并且功能强大,完全可以代替python的标 ...
从Theano到Lasagne：基于Python的深度学习的框架和库
从Theano到Lasagne:基于Python的深度学习的框架和库摘要:最近,深度神经网络以“Deep Dreams”形式在网站中如雨后春笋般出现,或是像谷歌研究原创论文中描述的那样:Incept ...
Comprehensive learning path – Data Science in Python深入学习路径-使用python数据中学习
http://blog.csdn.net/pipisorry/article/details/44245575 关于怎么学习python,并将python用于数据科学.数据分析.机器学习中的一篇非常好 ...
(转载)Python装饰器学习
转载出处:http://www.cnblogs.com/rhcad/archive/2011/12/21/2295507.html 这是在Python学习小组上介绍的内容,现学现卖.多练习是好的学习方 ...
正式学习React(五) react-redux源码分析
磨刀不误砍柴工,咱先把react-redux里的工具函数分析一下: 源码点这里 shallowEqual.js export default function shallowEqual(objA, ...
正式学习React(一) 开始学习之前必读
为什么要加这个必读!因为webpack本身是基于node环境的, 里面会涉及很多路径问题,我们可能对paths怎么写!webpack又是怎么找到这些paths的很迷惑. 本文是我已经写完正式学习Rea ...
python网络爬虫学习笔记
python网络爬虫学习笔记 By 钟桓 9月 4 2014 更新日期:9月 4 2014 文章文件夹 1. 介绍: 2. 从简单语句中開始: 3. 传送数据给server 4. HTTP头-描写叙述 ...
Python装饰器学习
Python装饰器学习(九步入门) 这是在Python学习小组上介绍的内容,现学现卖.多练习是好的学习方式. 第一步:最简单的函数,准备附加额外功能 ? 1 2 3 4 5 6 7 8 # -*- ...
Python的基础学习（第二周）
模块初始 sys模块 import sys sys.path #打印环境变量 sys.argv#打印该文件路径 #注意:该文件名字不能跟导入模块名字相同 os模块 import os cmd_res ...

随机推荐

luoguP3353 在你窗外闪耀的星星
P3353 在你窗外闪耀的星星题目描述飞逝的的时光不会模糊我对你的记忆.难以相信从我第一次见到你以来已经过去了3年.我仍然还生动地记得,3年前,在美丽的集美中学,从我看到你微笑着走出教室,你将头向 ...
BZOJ3875--骑士游戏（SPFA处理带后效性的动态规划）
3875: [Ahoi2014]骑士游戏 Time Limit: 30 Sec Memory Limit: 256 MBSubmit: 181 Solved: 91[Submit][Status] ...
java @Value注解和 @Data注解
@Value注解 service层代码 @Service public class HelloServiceImpl implements HelloService { @Autowired priv ...
C++入门经典-例3.3-if-else语句的奇偶性判别
1:代码如下: // 3.3.cpp : 定义控制台应用程序的入口点. // #include "stdafx.h" #include <iostream> using ...
java 多线程为何会出现无法捕获异常的现象?
提出问题: 很多Java初学者在初学java 多线程的时候可能会看到如下代码: public class ExceptionThread implements Runnable{ @Override ...
Python对字典分别按键（key）和值（value）进行排序
使用sorted函数进行排序 sorted(iterable,key,reverse),sorted一共有iterable,key,reverse这三个参数;其中iterable表示可以迭代的对象,例 ...
.tcc文件
今天看源码时碰到一个MemoryPool.h文件和MemoryPool.tcc文件,毫不犹豫在vs工程下把.tcc加到了源文件文件夹下, 把.h文件放到了头文件文件夹下.结果闹了笑话: 以下是解释, ...
ASP.NET中的物理路径与虚拟路径
物理路径:c:\PathsAndURLs\Content\Colors.html虚拟路径:(http://localhost:53274/Content/Colors.html)路径中端口号后面的那部 ...
React组件库集锦及学习视频
[转载]https://www.rails365.net/articles/react-zui-hao-de-ui-zu-jian-ku-ji-jin 这里有一篇讨论,说了哪个才是 React 最好的 ...
七、创建UcRESTTemplate请求管理器
一.创建UcRESTTemplate管理器封装 import com.alibaba.fastjson.JSON; import org.apache.http.client.config.Reque ...

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN
6	NaN	NaN	NaN	NaN

	c1	c2	c3	c4
0	5.1	NaN	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	c1	c2	c3	c4
1	4.9	3.0	1.4	0.2
2	4.7	3.2	NaN	0.2
3	7.0	3.2	4.7	1.4
4	6.4	3.2	4.5	1.5
5	6.9	3.1	4.9	NaN

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

	0	1	2	3
0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0
0	1.0	1.0	1.0	1.0
1	1.0	1.0	1.0	1.0
2	1.0	1.0	1.0	1.0

python-day18(正式学习)