Pandas处理缺失的数据

处理丢失数据

有两种丢失数据：

None
np.nan(NaN)

import numpy as np
import pandas
from pandas import DataFrame

1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

# 查看None的数据类型
type(None)

NoneType

2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

# 查看np.nan的数据类型
type(np.nan)

float

3. pandas中的None与NaN

创建DataFrame

df = DataFrame(data=np.random.randint(0,100,size=(10,8)))
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	22	13	16	41	81	7	25	86
1	23	3	57	20	4	58	69	40
2	35	81	80	63	53	43	20	35
3	40	14	48	89	34	4	64	46
4	36	14	62	30	80	99	88	59
5	9	98	83	81	69	46	39	7
6	55	88	81	75	35	44	27	64
7	14	74	24	3	54	99	75	53
8	24	22	41	68	1	87	46	19
9	82	10	36	99	85	36	12	83

# 将某些数组元素赋值为nan
df.iloc[1,4] = None
df.iloc[3,6] = None
df.iloc[7,7] = None
df.iloc[3,1] = None
df.iloc[5,5] = np.nan
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
1	23	3.0	57	20	NaN	58.0	69.0	40.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
3	40	NaN	48	89	34.0	4.0	NaN	46.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
5	9	98.0	83	81	69.0	NaN	39.0	7.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
7	14	74.0	24	3	54.0	99.0	75.0	NaN
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

pandas处理空值操作

判断函数

isnull()
notnull()

df.isnull()   # 为空,显示True
df.notnull()  # 不为空,显示True

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	True	True	True	True	True	True	True	True
1	True	True	True	True	False	True	True	True
2	True	True	True	True	True	True	True	True
3	True	False	True	True	True	True	False	True
4	True	True	True	True	True	True	True	True
5	True	True	True	True	True	False	True	True
6	True	True	True	True	True	True	True	True
7	True	True	True	True	True	True	True	False
8	True	True	True	True	True	True	True	True
9	True	True	True	True	True	True	True	True

df.notnull/ isnull().any()/ all()

df.isnull()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	False	False	False	False	False	False	False	False
1	False	False	False	False	True	False	False	False
2	False	False	False	False	False	False	False	False
3	False	True	False	False	False	False	True	False
4	False	False	False	False	False	False	False	False
5	False	False	False	False	False	True	False	False
6	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	True
8	False	False	False	False	False	False	False	False
9	False	False	False	False	False	False	False	False

df.isnull().any(axis=1)  # any表示or,axis=1表示行,即一行中存在True,即为True

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
9    False
dtype: bool

df.notnull().all(axis=1) # all表示and,axis=1表示行,即一行中全为True,才为True

0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
8     True
9     True
dtype: bool

df.loc[~df.isnull().any(axis=1)] # ~表示取反

往往这样搭配:

isnull()->any
notnull()->all

df.dropna() 可以选择过滤的是行还是列（默认为行）:axis中0表示行，1表示的列

df.dropna(axis=0)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

填充函数 Series/DataFrame

fillna():value和method参数

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	22	13.0	16	41	81.0	7.0	25.0	86.0
1	23	3.0	57	20	NaN	58.0	69.0	40.0
2	35	81.0	80	63	53.0	43.0	20.0	35.0
3	40	NaN	48	89	34.0	4.0	NaN	46.0
4	36	14.0	62	30	80.0	99.0	88.0	59.0
5	9	98.0	83	81	69.0	NaN	39.0	7.0
6	55	88.0	81	75	35.0	44.0	27.0	64.0
7	14	74.0	24	3	54.0	99.0	75.0	NaN
8	24	22.0	41	68	1.0	87.0	46.0	19.0
9	82	10.0	36	99	85.0	36.0	12.0	83.0

# bfill表示后, ffill表示前
# axis表示方向: 0:上下, 1:左右
df_test = df.fillna(method='bfill',axis=1).fillna(method='ffill',axis=1)
df_test

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}
.dataframe thead th {
    text-align: right;
}

	0	1	2	3	4	5	6	7
0	22.0	13.0	16.0	41.0	81.0	7.0	25.0	86.0
1	23.0	3.0	57.0	20.0	58.0	58.0	69.0	40.0
2	35.0	81.0	80.0	63.0	53.0	43.0	20.0	35.0
3	40.0	48.0	48.0	89.0	34.0	4.0	46.0	46.0
4	36.0	14.0	62.0	30.0	80.0	99.0	88.0	59.0
5	9.0	98.0	83.0	81.0	69.0	39.0	39.0	7.0
6	55.0	88.0	81.0	75.0	35.0	44.0	27.0	64.0
7	14.0	74.0	24.0	3.0	54.0	99.0	75.0	75.0
8	24.0	22.0	41.0	68.0	1.0	87.0	46.0	19.0
9	82.0	10.0	36.0	99.0	85.0	36.0	12.0	83.0

# 测试df_test中的哪些列中还有空值
df_test.isnull().any(axis=0)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
dtype: bool