数据分析06-五个pandas可视化项目

数据分析-06

- 数据分析-06

数据分析-06

pandas可视化

基本绘图

Series数据可视化

Series提供了plot方法以index作为x，以value作为y，完成数据可视化：

ts = pd.Series(np.random.randn(1000),

               index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

DataFrame数据可视化

DataFrame提供了plot方法可以指定某一列作为x，某一列作为y，完成数据可视化：

df3 = pd.DataFrame(np.random.randn(1000, 2),

                   columns=['B', 'C']).cumsum()

df3['A'] = np.arange(len(df3))

df3.plot(x='A', y='B')

高级绘图

plot()方法可以通过kind关键字参数提供不同的图像类型，包括：

类型	说明
`bar` or `barh`	柱状图
`hist`	直方图
`box`	箱线图
`scatter`	散点图
`pie`	饼状图

代码总结

pandas可视化

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

基本绘图

# Series可视化

s = pd.Series(np.random.normal(100, 10, 10),

              index=pd.date_range('2020-01-01', periods=10))

s.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf39235be0>

ts = pd.Series(np.random.randn(1000),

               index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf392a52b0>

# DataFrame可视化

data = np.random.normal(0, 1, (10, 2))  # 维度为(10,2)的一组随机数

df = pd.DataFrame(data, columns=['A', 'B'])

df['C'] = np.arange(10)

df

	A	B	C
0	-1.715010	-1.105532	0
1	-0.059422	-0.444824	1
2	-0.621798	0.653777	2
3	-2.577156	-0.406837	3
4	-2.208147	-0.188947	4
5	-0.120376	-1.299448	5
6	-0.609514	0.611829	6
7	-0.509499	0.682336	7
8	0.873368	-1.808792	8
9	-0.598329	0.618860	9

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf3932b080>

df.plot(x='C', y=['A', 'B'])

<matplotlib.axes._subplots.AxesSubplot at 0x1bf393ca2b0>

pandas高级绘图

# Series可视化

s = pd.Series(np.random.normal(100, 10, 10),

              index=pd.date_range('2020-01-01', periods=10))

s.plot.barh(color='dodgerblue')

<matplotlib.axes._subplots.AxesSubplot at 0x1bf394417b8>

data = np.random.normal(80, 3, (10, 2))  # 维度为(10,2)的一组随机数

df = pd.DataFrame(data, columns=['A', 'B'])

df.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf396ddc50>

pandas直方图

s.plot.hist(bins=20)

<matplotlib.axes._subplots.AxesSubplot at 0x1bf397e5f98>

pandas散点图

df.plot.scatter(x='A', y='B', s=80, c='A', cmap='jet')

<matplotlib.axes._subplots.AxesSubplot at 0x1bf39ab08d0>

pandas饼状图

values = [15, 13.3, 8.5, 7.3, 4.62, 51.28]

labels = ['Java', 'C', 'Python', 'C++', 'VB', 'Other']

s = pd.Series(values, index=labels)

s

Java      15.00

C         13.30

Python     8.50

C++        7.30

VB         4.62

Other     51.28

dtype: float64

s.plot.pie(figsize=(6,6), startangle=90, shadow=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1bf3aefa898>

df = pd.DataFrame(s, columns=['A'])

df['B'] = [14.1, 3, 18.2, 8, 2, 30.2]

df.plot.pie(subplots=True, figsize=(8,4), layout=(1,2))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BF3CAEA390>,

        <matplotlib.axes._subplots.AxesSubplot object at 0x000001BF3CB6DF28>]],

      dtype=object)

箱线图

data = pd.read_csv('../data/学生考试表现数据/StudentsPerformance.csv')

ms = data['math score']

ms.plot.box()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf3cc080b8>

df = data[['math score', 'writing score', 'reading score']]

df.plot.box()

<matplotlib.axes._subplots.AxesSubplot at 0x1bf3cc6f5c0>

项目资源下载：

在我的资源文件中下载
下载地址：https://download.csdn.net/download/yegeli/12562286

项目一：分析影响学生成绩的因素

学生成绩影响因素分析

import numpy as np

import pandas as pd

data = pd.read_csv('StudentsPerformance.csv')

data['total score'] = data.sum(axis=1)

# 参数number，object意为：统计数字列与字符串列

data.describe(include=['number', 'object'])

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score	total score
count	1000	1000	1000	1000	1000	1000.00000	1000.000000	1000.000000	1000.000000
unique	2	5	6	2	2	NaN	NaN	NaN	NaN
top	female	group C	some college	standard	none	NaN	NaN	NaN	NaN
freq	518	319	226	645	642	NaN	NaN	NaN	NaN
mean	NaN	NaN	NaN	NaN	NaN	66.08900	69.169000	68.054000	203.312000
std	NaN	NaN	NaN	NaN	NaN	15.16308	14.600192	15.195657	42.771978
min	NaN	NaN	NaN	NaN	NaN	0.00000	17.000000	10.000000	27.000000
25%	NaN	NaN	NaN	NaN	NaN	57.00000	59.000000	57.750000	175.000000
50%	NaN	NaN	NaN	NaN	NaN	66.00000	70.000000	69.000000	205.000000
75%	NaN	NaN	NaN	NaN	NaN	77.00000	79.000000	79.000000	233.000000
max	NaN	NaN	NaN	NaN	NaN	100.00000	100.000000	100.000000	300.000000

# 分析性别对学习成绩的影响（按性别分组）

r = data.pivot_table(index='gender')

r

	math score	reading score	total score	writing score
gender
female	63.633205	72.608108	208.708494	72.467181
male	68.728216	65.473029	197.512448	63.311203

# 可视化

r.T.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x24839422d00>

r.T.plot.pie(subplots=True,figsize=(12,3))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B50FFA0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B56EE80>],

      dtype=object)

总体来说，女生的成绩普遍比较好，但是男生更善于数学。

# 分析种族对学习成绩的影响

r = data.pivot_table(index='race/ethnicity')

r

	math score	reading score	total score	writing score
race/ethnicity
group A	61.629213	64.674157	188.977528	62.674157
group B	63.452632	67.352632	196.405263	65.600000
group C	64.463950	69.103448	201.394984	67.827586
group D	67.362595	70.030534	207.538168	70.145038
group E	73.821429	73.028571	218.257143	71.407143

种族划分（优秀-及格）： E - D - C - B - A

r.T.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x2483b5f6370>

# 分析父母教育水平对学习成绩的影响

r = data.pivot_table(index='parental level of education')

r.sort_values(by='total score')

	math score	reading score	total score	writing score
parental level of education
high school	62.137755	64.704082	189.290816	62.448980
some high school	63.497207	66.938547	195.324022	64.888268
some college	67.128319	69.460177	205.429204	68.840708
associate's degree	67.882883	70.927928	208.707207	69.896396
bachelor's degree	69.389831	73.000000	215.771186	73.381356
master's degree	69.745763	75.372881	220.796610	75.677966

# 可视化

r.T.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x2483b672c70>

父母受教育水平越高，学习成绩越好。

# 分析中午饭学习成绩的影响

r = data.pivot_table(index='lunch')

r.sort_values(by='total score', ascending=False)

	math score	reading score	total score	writing score
lunch
standard	70.034109	71.654264	212.511628	70.823256
free/reduced	58.921127	64.653521	186.597183	63.022535

# 可视化

r.plot.pie(subplots=True,figsize=(12,3))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B738310>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B75CFA0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B7891C0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B7B71F0>],

      dtype=object)

# 分析测试对成绩的影响

r = data.pivot_table(index='test preparation course')

r.sort_values(by='total score', ascending=False)

	math score	reading score	total score	writing score
test preparation course
completed	69.695531	73.893855	218.008380	74.418994
none	64.077882	66.534268	195.116822	64.504673

# 可视化

r.plot.pie(subplots=True,figsize=(12,3))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B57B280>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B768DC0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B67F9D0>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B8A43D0>],

      dtype=object)

r = data.pivot_table(index=['gender','test preparation course'])

r

		math score	reading score	total score	writing score
gender	test preparation course
female	completed	67.195652	77.375000	223.364130	78.793478
female	none	61.670659	69.982036	200.634731	68.982036
male	completed	72.339080	70.212644	212.344828	69.793103
male	none	66.688312	62.795455	189.133117	59.649351

分析前100名与后100名同学的不同情况

r = data.sort_values(by='total score', ascending=False)

top100 = r.head(100)

tail100 = r.tail(100)

r1 = pd.DataFrame({'top100':top100['gender'].value_counts(),

              'tail100':tail100['gender'].value_counts()})

r1

	top100	tail100
female	66	38
male	34	62

r1.plot.pie(subplots=True,figsize=(8,4))

array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B93EA30>,

       <matplotlib.axes._subplots.AxesSubplot object at 0x000002483B963BE0>],

      dtype=object)

data = data['parental level of education'].value_counts()

data.plot.pie(figsize=(6,6))

<matplotlib.axes._subplots.AxesSubplot at 0x2483c98cf70>

data.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x2483c9daf70>

r2 = pd.DataFrame({'top100':top100['parental level of education'].value_counts(),

              'tail100':tail100['parental level of education'].value_counts()})

r2

	top100	tail100
associate's degree	29	17
bachelor's degree	20	8
high school	6	32
master's degree	15	1
some college	21	14
some high school	9	28

r2.plot.barh()

<matplotlib.axes._subplots.AxesSubplot at 0x2483ca2e550>

总结：

  总体来说，女生的成绩普遍比较好，但是男生更善于数学。

  对于种族特征来讲， 优秀~良好：E - D - C - B - A

  父母受教育水平越高，学习成绩越好。

  建议从以下几个方面提高学生的学习成绩：

  建议每位同学吃好中午饭。

  建议每位同学尽量完成预科班考试。

项目二：泰坦尼克号生存人员数据分析与可视化

Kaggle案例泰坦尼克号生存预测分析

查看数据

用pandas加载数据

import pandas as pd #数据分析

import numpy as np #科学计算

data_train=pd.read_csv('train.csv')

data_train.head()

data_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],

      dtype='object')

data_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

有以下这些字段

PassengerId => 乘客ID

Survived => 生存

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

数据简单描述性分析

data_train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Name           891 non-null object

Sex            891 non-null object

Age            714 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Ticket         891 non-null object

Fare           891 non-null float64

Cabin          204 non-null object

Embarked       889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

训练数据中总共有891名乘客，但是我们有些属性的数据不全，比如说：

Age（年龄）属性只有714名乘客有记录
Cabin（客舱）更是只有204名乘客是已知的

具体数据数值情况，我们用下列的方法，得到数值型数据的一些分布

data_train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

mean字段告诉我们，大概0.383838的人最后获救了，平均乘客年龄大概是29.7岁

通过可视化的方式深入了解数据

获救情况人数可视化

import matplotlib.pyplot as plt

# 显示中文

from pylab import mpl

mpl.rcParams['font.sans-serif']=['Simhei']  #显示中文

mpl.rcParams['axes.unicode_minus']= False

data_train.Survived.value_counts().plot(kind='bar')

plt.title('获救情况（1为获救）')

plt.ylabel('人数')

plt.legend()

plt.show()

乘客等级分布可视化

data_train.Pclass.value_counts().plot(kind='bar')

plt.ylabel('人数')

plt.xlabel('乘客等级')

plt.title('乘客等级分布情况')

print(data_train.Pclass.value_counts())

3    491

1    216

2    184

Name: Pclass, dtype: int64

按年龄看获救分布可视化

data_train['Age'].plot.kde()

<matplotlib.axes._subplots.AxesSubplot at 0x1da647b4358>

plt.scatter(data_train.Survived,data_train.Age)

plt.ylabel('年龄')

plt.grid(axis='y')

plt.title('按照年龄看获救分布可视化（1为获救）')

plt.show()

各等级的乘客年龄分布

# 各等级的乘客年龄分布密度图

data_train.Age[data_train.Pclass == 1].plot(kind='kde')

data_train.Age[data_train.Pclass == 2].plot(kind='kde')

data_train.Age[data_train.Pclass == 3].plot(kind='kde')

plt.xlabel('年龄')

plt.ylabel('密度')

plt.title('各等级的乘客年龄分布')

plt.legend(('一等舱','二等舱','三等舱'))

plt.show()

各登船口岸上船人数可视化

data_train.Embarked.value_counts().plot(kind='bar')

plt.title('各登船港口上船人数')

plt.ylabel('人数')

plt.show()

所以我们在图上可以看出来:

被救的人300多点，不到半数；
3等舱乘客非常多；遇难和获救的人年龄跨度都很广；
3个不同的舱年龄总体趋势似乎也一致，2/3等舱乘客20岁多点的人最多，1等舱40岁左右的最多
登船港口人数按照S、C、Q递减，而且S远多于另外俩港口。>

查看每一个属性与获救情况的可视化

各乘客等级的获救情况

#看看各乘客等级的获救情况

Survived_1=data_train.Pclass[data_train.Survived==1].value_counts()

# Survived_1

Survived_0=data_train.Pclass[data_train.Survived==0].value_counts()

df=pd.DataFrame({'获救':Survived_1,'未获救':Survived_0})

df.plot(kind='bar')

plt.title('各乘客等级的获救情况可视化')

plt.xlabel('乘客等级')

plt.ylabel('人数')

# plt.legend()

plt.show()

各登船港口对于获救情况分析

#看看各登船港口的获救情况

Survived_1=data_train.Embarked[data_train.Survived==1].value_counts()

# Survived_1

Survived_0=data_train.Embarked[data_train.Survived==0].value_counts()

df=pd.DataFrame({'获救':Survived_1,'未获救':Survived_0})

df.plot(kind='bar')

plt.title('各登船港口的获救情况可视化')

plt.xlabel('登船港口')

plt.ylabel('人数')

# plt.legend()

plt.show()

各性别的获救情况

#看看各性别的获救情况

Survived_m=data_train.Survived[data_train.Sex=='male'].value_counts()

# Survived_m

Survived_f=data_train.Survived[data_train.Sex=='female'].value_counts()

df=pd.DataFrame({'男性':Survived_m,'女性':Survived_f})

df.plot(kind='bar')

plt.title('按照性别看获救情况')

plt.xlabel('获救')

plt.ylabel('人数')

plt.show()

获救的女性要多于男性。

堂兄弟和父母字段对于获救情况分析

# 堂兄弟/妹个数

data_train.pivot_table(index=['SibSp', 'Survived'], values='PassengerId', aggfunc='count')

		PassengerId
SibSp	Survived
0	0	398
0	1	210
1	0	97
1	1	112
2	0	15
2	1	13
3	0	12
3	1	4
4	0	15
4	1	3
5	0	5
8	0	7

# 父母个数

data_train.pivot_table(index=['Parch', 'Survived'], values='PassengerId', aggfunc='count')

		PassengerId
Parch	Survived
0	0	445
0	1	233
1	0	53
1	1	65
2	0	40
2	1	40
3	0	2
3	1	3
4	0	4
5	0	4
5	1	1
6	0	1

ticket是船票编号，是unique的，和最后的结果没有太大的关系，不纳入考虑的特征范畴

cabin只有204个乘客有值，我们先看看它的一个分布

#cabin只有204个乘客有值，我们先看看它的一个分布

data_train.Cabin.value_counts()

G6             4

C23 C25 C27    4

B96 B98        4

F2             3

E101           3

              ..

D30            1

A7             1

D47            1

E31            1

C99            1

Name: Cabin, Length: 147, dtype: int64

# 分析cabin这个值的有无，对于survival的分布状况

survival_cabin=data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()

survival_cabin

survival_nocabin=data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()

df=pd.DataFrame({'有':survival_cabin,'无':survival_nocabin})

df.plot(kind='bar')

plt.title('按照Cabin有无去看获救情况')

plt.xlabel('获救情况')

plt.ylabel('Cabin有无')

plt.show()

有Cabin记录的似乎获救概率稍高一些

数据预处理

# 打印数据前几行

data_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

数据缺失值处理

# 查看数据

data_train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Name           891 non-null object

Sex            891 non-null object

Age            714 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Ticket         891 non-null object

Fare           891 non-null float64

Cabin          204 non-null object

Embarked       889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

#补充Age的缺失值

data_train['Age']=data_train['Age'].fillna(data_train['Age'].mean())

#按Cabin有无数据，将这个属性处理成Yes和No两种类型

def set_cabin(df):

    df.loc[(df.Cabin.notnull()),'Cabin']='Yes'

    df.loc[(df.Cabin.isnull()),'Cabin']='No'

    return df

data_train=set_cabin(data_train)

#对Embarked进行填充数据

data_train['Embarked']=data_train['Embarked'].fillna('S')

# 查看数据

data_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	No	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	Yes	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	No	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	Yes	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	No	S

data_train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Name           891 non-null object

Sex            891 non-null object

Age            891 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Ticket         891 non-null object

Fare           891 non-null float64

Cabin          891 non-null object

Embarked       891 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 83.7+ KB

数据one-hot处理

因为逻辑回归建模时，需要输入的特征都是数值型特征，我们通常会先对类目型的特征因子化/one-hot编码

什么叫做因子化/one-hot编码？举个例子

以Embarked为例，原本一个属性维度，因为其取值可以是[‘S’,’C’,’Q‘]，而将其平展开为’Embarked_C’,’Embarked_S’, ‘Embarked_Q’三个属性

原本Embarked取值为S的，在此处的”Embarked_S”下取值为1，在’Embarked_C’, ‘Embarked_Q’下取值为0
原本Embarked取值为C的，在此处的”Embarked_C”下取值为1，在’Embarked_S’, ‘Embarked_Q’下取值为0
原本Embarked取值为Q的，在此处的”Embarked_Q”下取值为1，在’Embarked_C’, ‘Embarked_S’下取值为0
我们使用pandas的”get_dummies”来完成这个工作，并拼接在原来的”data_train”之上

# 因为逻辑回归建模时，需要输入的特征都是数值型特征

# 我们先对类目型的特征离散/因子化

# 以Cabin为例，原本一个属性维度，因为其取值可以是['yes','no']，而将其平展开为'Cabin_yes','Cabin_no'两个属性

# 原本Cabin取值为yes的，在此处的'Cabin_yes'下取值为1，在'Cabin_no'下取值为0

# 原本Cabin取值为no的，在此处的'Cabin_yes'下取值为0，在'Cabin_no'下取值为1

# 我们使用pandas的get_dummies来完成这个工作，并拼接在原来的data_train之上，如下所示

# Cabin,Embarked,Pclass,Sex

dummies_Cabin=pd.get_dummies(data_train['Cabin'],prefix='Cabin')

dummies_Embarked=pd.get_dummies(data_train['Embarked'],prefix='Cabin')

dummies_Pclass=pd.get_dummies(data_train['Pclass'],prefix='Pclass')

dummies_Sex=pd.get_dummies(data_train['Sex'],prefix='Sex')

df=pd.concat([data_train,dummies_Cabin,dummies_Embarked,dummies_Pclass,dummies_Sex],axis=1)

df.drop(['Pclass','Name','Sex','Ticket','Cabin','Embarked'],axis=1,inplace=True)

df.head()

	PassengerId	Survived	Age	SibSp	Fare	Cabin_No	Cabin_Yes	Cabin_C	Cabin_S	Pclass_1	Pclass_3	Sex_female	Sex_male
0	1	0	22.0	1	7.2500	1	0	0	1	0	1	0	1
1	2	1	38.0	1	71.2833	0	1	1	0	1	0	1	0
2	3	1	26.0	0	7.9250	1	0	0	1	0	1	1	0
3	4	1	35.0	1	53.1000	0	1	0	1	1	0	1	0
4	5	0	35.0	0	8.0500	1	0	0	1	0	1	0	1

df.describe()

	PassengerId	Survived	Age	SibSp	Parch	Fare	Cabin_No	Cabin_Yes	Cabin_C	Cabin_Q	Cabin_S	Pclass_1	Pclass_2	Pclass_3	Sex_female	Sex_male
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	29.699118	0.523008	0.381594	32.204208	0.771044	0.228956	0.188552	0.086420	0.725028	0.242424	0.206510	0.551066	0.352413	0.647587
std	257.353842	0.486592	13.002015	1.102743	0.806057	49.693429	0.420397	0.420397	0.391372	0.281141	0.446751	0.428790	0.405028	0.497665	0.477990	0.477990
min	1.000000	0.000000	0.420000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	22.000000	0.000000	0.000000	7.910400	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	446.000000	0.000000	29.699118	0.000000	0.000000	14.454200	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	0.000000	1.000000
75%	668.500000	1.000000	35.000000	1.000000	0.000000	31.000000	1.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	1.000000	1.000000	1.000000
max	891.000000	1.000000	80.000000	8.000000	6.000000	512.329200	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

数据标准化处理

我们还得做一些处理，Age和Fare两个属性，乘客的数值幅度变化太大,进行标准差标准化处理

a=df.Age

df['Age_scaled']=(a-a.mean())/(a.std())

df=df.drop('Age',axis=1)

b=df.Fare

df['Fare_scaled']=(b-b.mean())/(b.std())

df=df.drop('Fare',axis=1)

df.head()

	PassengerId	Survived	SibSp	Cabin_No	Cabin_Yes	Cabin_C	Cabin_S	Pclass_1	Pclass_3	Sex_female	Sex_male	Age_scaled	Fare_scaled
0	1	0	1	1	0	0	1	0	1	0	1	-0.592148	-0.502163
1	2	1	1	0	1	1	0	1	0	1	0	0.638430	0.786404
2	3	1	0	1	0	0	1	0	1	1	0	-0.284503	-0.488580
3	4	1	1	0	1	0	1	1	0	1	0	0.407697	0.420494
4	5	0	0	1	0	0	1	0	1	0	1	0.407697	-0.486064

数据建模–逻辑回归

我们把需要的feature字段取出来，转成numpy格式，使用scikit-learn中的LogisticRegression建模。

# 我们把需要的feature字段取出来，转成numpy格式，使用scikit-learn中的LogisticRegression建模

from sklearn import linear_model

train_df=df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')

train_np=train_df.values

# y即Survival结果

y=train_np[:,0]

# X即特征属性值

X=train_np[:,1:]

# fit到RandomForestRegressor之中

clf=linear_model.LogisticRegression(penalty='l2')

clf.fit(X,y)

# 模型正确率

print(clf.score(X,y))

clf

0.8125701459034792

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,

          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

          verbose=0, warm_start=False)

penalty：惩罚项，str类型，可选参数为l1和l2，默认为l2。用于指定惩罚项中使用的规范。newton-cg、sag和lbfgs求解算法只支持L2规范。L1G规范假设的是模型的参数满足拉普拉斯分布，L2假设的模型参数满足高斯分布，所谓的范式就是加上对参数的约束，使得模型更不会过拟合(overfit)

tol：停止求解的标准，float类型，默认为1e-4。就是求解到多少的时候，停止，认为已经求出最优解。

c：正则化系数λ的倒数，float类型，默认为1.0。必须是正浮点型数。像SVM一样，越小的数值表示越强的正则化。

接下来咱们对训练集和测试集做一样的操作

# 读取测试集数据

data_test=pd.read_csv('test.csv')

data_test.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# 描述分析数据

data_test.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 418 entries, 0 to 417

Data columns (total 11 columns):

PassengerId    418 non-null int64

Pclass         418 non-null int64

Name           418 non-null object

Sex            418 non-null object

Age            332 non-null float64

SibSp          418 non-null int64

Parch          418 non-null int64

Ticket         418 non-null object

Fare           417 non-null float64

Cabin          91 non-null object

Embarked       418 non-null object

dtypes: float64(2), int64(4), object(5)

memory usage: 36.0+ KB

# 接着我们对test_data做和train_data中一致的特征变换

# 对Fare处理

data_test.loc[(data_test.Fare.isnull()),'Fare']=0

#补充Age的缺失值

data_test['Age']=data_test['Age'].fillna(data_test['Age'].mean())

#按Cabin有无数据，将这个属性处理成Yes和No两种类型

def set_Cabin(df):

    df.loc[(df.Cabin.notnull()),'Cabin']='Yes'

    df.loc[(df.Cabin.isnull()),'Cabin']='No'

    return df

data_test=set_cabin(data_test)

# one-hot编码

# Cabin,Embarked,Sex,Pclass

dummies_Cabin=pd.get_dummies(data_test['Cabin'],prefix='Cabin')

dummies_Embarked=pd.get_dummies(data_test['Embarked'],prefix='Embarked')

dummies_Pclass=pd.get_dummies(data_test['Pclass'],prefix='Pclass')

dummies_Sex=pd.get_dummies(data_test['Sex'],prefix='Sex')

df_test=pd.concat([data_test,dummies_Cabin,dummies_Embarked,dummies_Pclass,dummies_Sex],axis=1)

df_test.drop(['Pclass','Name','Sex','Ticket','Cabin','Embarked'],axis=1,inplace=True)

# 标准化处理数据  Age Fare

a=df_test.Age

df_test['Age_scaled']=(a-a.mean())/(a.std())

df_test=df_test.drop('Age',axis=1)

b=df_test.Fare

df_test['Fare_scaled']=(b-b.mean())/(b.std())

df_test=df_test.drop('Fare',axis=1)

df_test.head()

	PassengerId	SibSp	Parch	Cabin_No	Embarked_Q	Embarked_S	Pclass_2	Pclass_3	Sex_female	Sex_male	Age_scaled	Fare_scaled
0	892	0	0	1	1	0	0	1	0	1	0.334592	-0.496043
1	893	1	0	1	0	1	0	1	1	0	1.323944	-0.510885
2	894	0	0	1	1	0	1	0	0	1	2.511166	-0.462780
3	895	0	0	1	0	1	0	1	0	1	-0.259019	-0.481127
4	896	1	1	1	0	1	0	1	1	0	-0.654760	-0.416242

test=df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')

predictions=clf.predict(test)

result=pd.DataFrame({'PassengerId':data_test['PassengerId'].values,

                     'Survived':predictions.astype(np.int32)})

result.to_csv('logistic_regression_predictions.csv',index=False)

# 读取logistic_regression_predictions.csv数据

pd.read_csv('logistic_regression_predictions.csv').head(10)

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1
5	897	0
6	898	1
7	899	0
8	900	1
9	901	0

项目三：movielens电影数据分析与可视化

movielens电影评分数据分析(上)

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

读取数据

# 从用户表读取用户信息

users = pd.read_table('users.dat', header=None, names=['UserID','Gender','Age','Occupation','Zip-code'], sep='::',engine='python')

# 打印列表长度，共有6040条记录

print(len(users))

# 查看前五条记录

users.head(5)

	UserID	Gender	Age	Occupation	Zip-code
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

# 同样方法，导入电影评分表

ratings = pd.read_table('ratings.dat', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'], sep='::',engine='python')

# 打印列表长度

print(len(ratings))

print(ratings.head(5))

# 同样方法，导入电影数据表

movies = pd.read_table('movies.dat', header=None, names=['MovieID', 'Title', 'Genres'], sep='::',engine='python')

print(len(movies))

print(movies.head(5))

1000209

   UserID  MovieID  Rating  Timestamp

0       1     1193       5  978300760

1       1      661       3  978302109

2       1      914       3  978301968

3       1     3408       4  978300275

4       1     2355       5  978824291

3883

   MovieID                               Title                        Genres

0        1                    Toy Story (1995)   Animation|Children's|Comedy

1        2                      Jumanji (1995)  Adventure|Children's|Fantasy

2        3             Grumpier Old Men (1995)                Comedy|Romance

3        4            Waiting to Exhale (1995)                  Comedy|Drama

4        5  Father of the Bride Part II (1995)                        Comedy

合并数据表

# 导入完成之后，我们可以发现这三张表类似于数据库中的表

# 要进行数据分析，我们就要将多张表进行合并才有助于分析 先将users与ratings两张表合并再跟movied合并

data = pd.merge(pd.merge(users, ratings), movies)

data.tail(5)

	UserID	Gender	Age	Occupation	Zip-code	MovieID	Rating	Timestamp	Title	Genres
1000204	5949	M	18	17	47901	2198	5	958846401	Modulations (1998)	Documentary
1000205	5675	M	35	14	30030	2703	3	976029116	Broken Vessels (1998)	Drama
1000206	5780	M	18	17	92886	2845	1	958153068	White Boys (1999)	Drama
1000207	5851	F	18	20	55410	3607	5	957756608	One Little Indian (1973)	Comedy\|Drama\|Western
1000208	5938	M	25	1	35401	2909	4	957273353	Five Wives, Three Secretaries and Me (1998)	Documentary

对数据初步描述分析

data.describe()

	UserID	Age	Occupation	MovieID	Rating	Timestamp
count	1.000209e+06	1.000209e+06	1.000209e+06	1.000209e+06	1.000209e+06	1.000209e+06
mean	3.024512e+03	2.973831e+01	8.036138e+00	1.865540e+03	3.581564e+00	9.722437e+08
std	1.728413e+03	1.175198e+01	6.531336e+00	1.096041e+03	1.117102e+00	1.215256e+07
min	1.000000e+00	1.000000e+00	0.000000e+00	1.000000e+00	1.000000e+00	9.567039e+08
25%	1.506000e+03	2.500000e+01	2.000000e+00	1.030000e+03	3.000000e+00	9.653026e+08
50%	3.070000e+03	2.500000e+01	7.000000e+00	1.835000e+03	4.000000e+00	9.730180e+08
75%	4.476000e+03	3.500000e+01	1.400000e+01	2.770000e+03	4.000000e+00	9.752209e+08
max	6.040000e+03	5.600000e+01	2.000000e+01	3.952000e+03	5.000000e+00	1.046455e+09

data.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 1000209 entries, 0 to 1000208

Data columns (total 10 columns):

 #   Column      Non-Null Count    Dtype

---  ------      --------------    -----

 0   UserID      1000209 non-null  int64

 1   Gender      1000209 non-null  object

 2   Age         1000209 non-null  int64

 3   Occupation  1000209 non-null  int64

 4   Zip-code    1000209 non-null  object

 5   MovieID     1000209 non-null  int64

 6   Rating      1000209 non-null  int64

 7   Timestamp   1000209 non-null  int64

 8   Title       1000209 non-null  object

 9   Genres      1000209 non-null  object

dtypes: int64(6), object(4)

memory usage: 83.9+ MB

查看数据

# 合并后的每一条记录反映了每个人的年龄，职业，性别，邮编，电影ID，评分，时间戳，电影信息，电影分类等一系列信息

# 比如我们查看用户id为1的所有信息

data[data.UserID==1].head()

	UserID	Gender	Age	Occupation	Zip-code	MovieID	Rating	Timestamp	Title	Genres
0	1	F	1	10	48067	1193	5	978300760	One Flew Over the Cuckoo's Nest (1975)	Drama
1725	1	F	1	10	48067	661	3	978302109	James and the Giant Peach (1996)	Animation\|Children's\|Musical
2250	1	F	1	10	48067	914	3	978301968	My Fair Lady (1964)	Musical\|Romance
2886	1	F	1	10	48067	3408	4	978300275	Erin Brockovich (2000)	Drama
4201	1	F	1	10	48067	2355	5	978824291	Bug's Life, A (1998)	Animation\|Children's\|Comedy

r = data['Zip-code'].value_counts()

r = r.sort_values(ascending=False).head(10)

r.plot(kind='bar')

plt.xticks(rotation=45)

plt.show()

# 查看评分次数多的电影并进行排序   data_rating_num接收

data_rating_num=data.groupby('Title').size()

data_rating_num.head(10)

Title

$1,000,000 Duck (1971)                37

'Night Mother (1986)                  70

'Til There Was You (1997)             52

'burbs, The (1989)                   303

...And Justice for All (1979)        199

1-900 (1994)                           2

10 Things I Hate About You (1999)    700

101 Dalmatians (1961)                565

101 Dalmatians (1996)                364

12 Angry Men (1957)                  616

dtype: int64

#进行排序

data_rating_num_sorted=data_rating_num.sort_values(ascending=False)

data_rating_num_sorted = data_rating_num_sorted[(data_rating_num_sorted>300) & (data_rating_num_sorted<400)]

data_rating_num_sorted

Title

Yellow Submarine (1968)          399

Anaconda (1997)                  399

Snow Falling on Cedars (1999)    398

His Girl Friday (1940)           397

First Blood (1982)               397

                                ...

Godzilla (Gojira) (1954)         301

Rambo III (1988)                 301

Zero Effect (1998)               301

Short Cuts (1993)                301

Old Yeller (1957)                301

Length: 256, dtype: int64

查看每一部电影不同性别的平均评分并计算分歧差值，之后排序

# 查看每一部电影不同性别的平均评分 data_gender接收

data_gender=data.pivot_table(index='Title',columns='Gender',values='Rating',aggfunc='mean')

data_gender = data_gender.loc[data_rating_num_sorted.index]

data_gender.head()

Gender	F	M
Title
Yellow Submarine (1968)	3.714286	3.689286
Anaconda (1997)	2.000000	2.248447
Snow Falling on Cedars (1999)	3.482014	3.374517
His Girl Friday (1940)	4.312500	4.213439
First Blood (1982)	3.285714	3.599448

# 查看电影分歧最大的那部电影，在原数据中体现

data_gender['diff']=np.fabs(data_gender.F-data_gender.M)

data_gender.head()

Gender	F	M	diff
Title
Yellow Submarine (1968)	3.714286	3.689286	0.025000
Anaconda (1997)	2.000000	2.248447	0.248447
Snow Falling on Cedars (1999)	3.482014	3.374517	0.107497
His Girl Friday (1940)	4.312500	4.213439	0.099061
First Blood (1982)	3.285714	3.599448	0.313733

# 男女电影分歧最大进行排序 data_gender_sorted接收

data_gender_sorted=data_gender.sort_values(by='diff',ascending=False)

data_gender_sorted_top10 = data_gender_sorted.head(10)

data_gender_sorted_top10

Gender	F	M	diff
Title
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Jumpin' Jack Flash (1986)	3.254717	2.578358	0.676359
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787
For a Few Dollars More (1965)	3.409091	3.953795	0.544704
Porky's (1981)	2.296875	2.836364	0.539489
Fright Night (1985)	2.973684	3.500000	0.526316
Anastasia (1997)	3.800000	3.281609	0.518391
French Kiss (1995)	3.535714	3.056962	0.478752
Little Shop of Horrors, The (1960)	3.650000	3.179688	0.470312

genres = movies.set_index(movies['Title']).loc[data_gender_sorted_top10.index].Genres

data_gender_sorted_top10['Genres'] = genres

data_gender_sorted_top10

<ipython-input-16-f0465e0ad586>:2: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  data_gender_sorted_top10['Genres'] = genres

Gender	F	M	diff	Genres
Title
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359	Comedy
Jumpin' Jack Flash (1986)	3.254717	2.578358	0.676359	Action\|Comedy\|Romance\|Thriller
Longest Day, The (1962)	3.411765	4.031447	0.619682	Action\|Drama\|War
Cable Guy, The (1996)	2.250000	2.863787	0.613787	Comedy
For a Few Dollars More (1965)	3.409091	3.953795	0.544704	Western
Porky's (1981)	2.296875	2.836364	0.539489	Comedy
Fright Night (1985)	2.973684	3.500000	0.526316	Comedy\|Horror
Anastasia (1997)	3.800000	3.281609	0.518391	Animation\|Children's\|Musical
French Kiss (1995)	3.535714	3.056962	0.478752	Comedy\|Romance
Little Shop of Horrors, The (1960)	3.650000	3.179688	0.470312	Comedy\|Horror

算出每部电影平均得分并对其进行排序

#算出每部电影平均得分并对其进行排序 data_mean_rating 接收

data_rating_num = data_rating_num[data_rating_num>100]

mask = data['Title'].apply(lambda x: True if x in data_rating_num.index else False)

data_mean_rating = data[mask].pivot_table(index='Title', values=['Rating'])

data_mean_rating

	Rating
Title
'burbs, The (1989)	2.910891
...And Justice for All (1979)	3.713568
10 Things I Hate About You (1999)	3.422857
101 Dalmatians (1961)	3.596460
101 Dalmatians (1996)	3.046703
...	...
Young Guns II (1990)	2.907859
Young Sherlock Holmes (1985)	3.390501
Your Friends and Neighbors (1998)	3.376147
Zero Effect (1998)	3.750831
eXistenZ (1999)	3.256098

2006 rows × 1 columns

# 对电影平均得分排序

data_mean_rating_sorted=data_mean_rating.sort_values(by='Rating',ascending=False)

data_mean_rating_sorted.head()

	Rating
Title
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)	4.560510
Shawshank Redemption, The (1994)	4.554558
Godfather, The (1972)	4.524966
Close Shave, A (1995)	4.520548
Usual Suspects, The (1995)	4.517106

取评分数量最多的前20条数据

#对评分数量进行排序，并取前20条数据

hot_movies_sorted=data_rating_num.sort_values(ascending=False)

hot_movies_sorted[:20]

Title

American Beauty (1999)                                   3428

Star Wars: Episode IV - A New Hope (1977)                2991

Star Wars: Episode V - The Empire Strikes Back (1980)    2990

Star Wars: Episode VI - Return of the Jedi (1983)        2883

Jurassic Park (1993)                                     2672

Saving Private Ryan (1998)                               2653

Terminator 2: Judgment Day (1991)                        2649

Matrix, The (1999)                                       2590

Back to the Future (1985)                                2583

Silence of the Lambs, The (1991)                         2578

Men in Black (1997)                                      2538

Raiders of the Lost Ark (1981)                           2514

Fargo (1996)                                             2513

Sixth Sense, The (1999)                                  2459

Braveheart (1995)                                        2443

Shakespeare in Love (1998)                               2369

Princess Bride, The (1987)                               2318

Schindler's List (1993)                                  2304

L.A. Confidential (1997)                                 2288

Groundhog Day (1993)                                     2278

dtype: int64

查看不同年龄的分布情况并且采用直方图进行可视化

import matplotlib.pyplot as plt

users.Age.plot.hist(bins=10, edgecolor='white')

plt.title('users_ages')

plt.xlabel('age')

plt.ylabel('count of age')

xticks = np.linspace(np.min(users.Age), np.max(users.Age), 11)

plt.xticks(xticks)

plt.show()

每10岁一个区间，统计出用户的年龄分组分布

data['Age'].plot(kind='hist',bins=10)

plt.xticks(rotation=45)

plt.show()

统计数据集中每一类型的电影频数

df = pd.DataFrame(movies.Genres.str.split('|').tolist())

df = df.stack().reset_index()

df = df.drop(['level_0', 'level_1'], axis=1)

genres = df.groupby(0).size()

genres.sort_values(ascending=False).plot(kind='bar')

# movies_ratings_sorted.

plt.xticks(rotation=45)

plt.show()

项目四：二手房源信息数据分析与可视化

二手房源信息数据分析与可视化

# 导入模块

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

# 设置使中文显示完整

plt.rcParams['font.sans-serif']=['SimHei']

plt.rcParams['axes.unicode_minus']=False

#所有房源信息

house=pd.read_csv('house.csv')

house.head(1)

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo
0	0	宝星华庭一层带花园，客厅挑高，通透四居室。房主自荐	宝星国际三期	底层(共22层)2010年建板塔结合	4室1厅	298.79平米	底层(共22层)2010年建板塔结合	距离15号线望京东站680米房本满五年	2598	86951	53人关注 / 共44次带看 / 一年前发布

数据描述性分析

house.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 16108 entries, 0 to 16107

Data columns (total 11 columns):

index         16108 non-null int64

title         16108 non-null object

community     16108 non-null object

years         16106 non-null object

housetype     16108 non-null object

square        16108 non-null object

floor         16106 non-null object

taxtype       15361 non-null object

totalPrice    16108 non-null int64

unitPrice     16108 non-null int64

followInfo    16108 non-null object

dtypes: int64(3), object(8)

memory usage: 1.4+ MB

# 所有小区信息

community=pd.read_csv('community_describe.csv')

community.head()

	index	id	community	district	bizcircle	tagList	onsale
0	0	1111000004310	什坊院甲3号院	海淀	田村	NaN	0
1	1	1111027373682	大慧寺6号院	海淀	白石桥	NaN	2
2	2	1111027373683	东花市北里东区	东城	东花市	近地铁1号线王府井站	0
3	3	1111027373684	东花市北里西区	东城	东花市	近地铁7号线广渠门内站	7
4	4	1111027373685	东花市北里中区	东城	东花市	近地铁2号线朝阳门站	9

# 合并小区信息和房源信息表，可以获得房源更详细的地理位置

house_detail=pd.merge(house,community,on='community')

# 打印数据

house_detail.head(1)

# len(house_detail)

	index_x	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	index_y	id	district	bizcircle	tagList	onsale
0	0	宝星华庭一层带花园，客厅挑高，通透四居室。房主自荐	宝星国际三期	底层(共22层)2010年建板塔结合	4室1厅	298.79平米	底层(共22层)2010年建板塔结合	距离15号线望京东站680米房本满五年	2598	86951	53人关注 / 共44次带看 / 一年前发布	1535	1111027376204	朝阳	望京	近地铁15号线望京东站	7

数值型数据描述

house.describe()

	index	totalPrice	unitPrice
count	16108.000000	16108.000000	16108.000000
mean	8053.500000	747.983735	77656.823814
std	4650.123403	536.202306	23616.114546
min	0.000000	15.000000	2539.000000
25%	4026.750000	439.000000	60449.500000
50%	8053.500000	600.000000	75094.000000
75%	12080.250000	870.000000	91474.250000
max	16107.000000	12500.000000	159991.000000

数据预处理1：将数据从字符串提取出来

# 将字符串转换成数字

def data_ad(select_data,str):

    if str in select_data:

       return float(select_data[0:select_data.find(str)])

    else:

       return None

# 处理房屋面积数据

house['square']=house['square'].apply(data_ad,str='平米')

# 查看数据

house.head(1)

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo
0	0	宝星华庭一层带花园，客厅挑高，通透四居室。房主自荐	宝星国际三期	底层(共22层)2010年建板塔结合	4室1厅	298.79	底层(共22层)2010年建板塔结合	距离15号线望京东站680米房本满五年	2598	86951	53人关注 / 共44次带看 / 一年前发布

house.describe()

	index	square	totalPrice	unitPrice	attention
count	16026.000000	16026.000000	16026.000000	16026.000000	16026.000000
mean	8061.290715	95.997246	743.136029	77796.268876	58.154936
std	4648.836720	57.606275	510.155956	23441.070459	68.642351
min	0.000000	15.290000	40.000000	11393.000000	0.000000
25%	4037.250000	61.110000	440.000000	60589.250000	17.000000
50%	8066.500000	81.200000	599.000000	75184.500000	37.000000
75%	12085.750000	112.757500	870.000000	91516.000000	73.000000
max	16107.000000	2623.280000	12000.000000	159991.000000	1401.000000

house[house['square']<16]

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	attention	layer	year
15260	15260	智德北巷（北河沿大街）+小户型一居+南向	智德北巷	中楼层(共6层)1985年建板楼	1室0厅	15.29	中楼层(共6层)1985年建板楼	距离5号线灯市口站1113米	220	143885	56人关注 / 共2次带看 / 8天以前发布	56.0	中楼层	1985年

户型的种类

house.housetype.value_counts()

2室1厅     6582

3室1厅     2534

1室1厅     2472

3室2厅     1424

2室2厅     1018

1室0厅      620

4室2厅      496

4室1厅      181

2房间1卫     100

5室2厅       92

1房间1卫      87

1室2厅       64

4室3厅       55

3房间1卫      44

3室0厅       35

2室0厅       34

车位         32

6室2厅       29

5室3厅       22

联排别墅       19

1房间0卫      16

5室1厅       15

6室3厅       13

独栋别墅       12

3室3厅       11

4室0厅       10

叠拼别墅       10

双拼别墅        9

4房间2卫       9

4房间1卫       6

2房间2卫       6

6室1厅        5

5室4厅        4

6室4厅        3

7室3厅        3

5室5厅        3

3房间2卫       3

5房间3卫       2

2室3厅        2

9室4厅        2

6房间4卫       2

2房间0卫       2

6房间2卫       2

3房间3卫       2

4房间3卫       2

7室2厅        2

8室2厅        1

5室0厅        1

6室0厅        1

2房间3卫       1

4室4厅        1

5房间2卫       1

7室0厅        1

8房间5卫       1

3室4厅        1

8室4厅        1

6房间3卫       1

7室1厅        1

Name: housetype, dtype: int64

数据预处理2：删除车位信息

car=house[house.housetype.str.contains('车位')]

# 记录中共有车位

car.shape[0]

# 删除车位信息

house.drop(car.index,inplace=True)

# 现在还剩?条记录

car.shape

(32, 11)

数据分析1：价格最高的5个别墅

villa=house[house.housetype.str.contains('别墅')]

# 记录中共有别墅?

villa.shape[0]

# 排序

villa.sort_values(by='totalPrice',ascending=False).head(5)

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo
8020	8020	香山清琴二期独栋别墅，毛坯房原始户型，花园1200平米	香山清琴	2层2007年建	独栋别墅	NaN	2层2007年建	房本满五年	12500	124681	45人关注 / 共7次带看 / 2个月以前发布
102	102	千尺独栋北入户红顶商人金融界入住社区	龙湖颐和原著	2层2010年建	独栋别墅	NaN	2层2010年建	距离4号线西苑站839米房本满五年	12000	112012	231人关注 / 共26次带看 / 一年前发布
2729	2729	临湖独栋别墅花园半亩观景湖面和绿化满五年有车库房主自荐	紫玉山庄	3层2000年建	独栋别墅	NaN	3层2000年建	房本满五年	6000	148618	108人关注 / 共16次带看 / 5个月以前发布
3141	3141	银湖别墅独栋望京公园旁五环里封闭式社区	银湖别墅	3层1998年建	独栋别墅	NaN	3层1998年建	房本满五年	5000	130348	9人关注 / 共3次带看 / 5个月以前发布
4112	4112	首排别墅位置好全景小区绿化和人工湖有车库	亚运新新家园朗月园一期	1层2003年建	联排别墅	NaN	1层2003年建	房本满五年	3800	82364	0人关注 / 共4次带看 / 4个月以前发布

数据预处理3：删除别墅信息

house.drop(villa.index,inplace=True)

# 现在还剩下?条记录

house.shape[0]

数据分析2：找出数据中的住房户型分布

# 户型分布

house.housetype.value_counts()

2室1厅     6582

3室1厅     2534

1室1厅     2472

3室2厅     1424

2室2厅     1018

1室0厅      620

4室2厅      496

4室1厅      181

2房间1卫     100

5室2厅       92

1房间1卫      87

1室2厅       64

4室3厅       55

3房间1卫      44

3室0厅       35

2室0厅       34

6室2厅       29

5室3厅       22

1房间0卫      16

5室1厅       15

6室3厅       13

3室3厅       11

4室0厅       10

4房间2卫       9

2房间2卫       6

4房间1卫       6

6室1厅        5

5室4厅        4

6室4厅        3

3房间2卫       3

7室3厅        3

5室5厅        3

5房间3卫       2

2室3厅        2

9室4厅        2

6房间4卫       2

2房间0卫       2

6房间2卫       2

3房间3卫       2

4房间3卫       2

7室2厅        2

8室2厅        1

5室0厅        1

6室0厅        1

2房间3卫       1

4室4厅        1

5房间2卫       1

7室0厅        1

8房间5卫       1

3室4厅        1

8室4厅        1

6房间3卫       1

7室1厅        1

Name: housetype, dtype: int64

# 可视化绘制

house_type=house.housetype.value_counts()

house_type.head(10).plot(kind='bar',title='户型数量分布',rot=30)

# plt.show()

<matplotlib.axes._subplots.AxesSubplot at 0x12dae357a90>

数据分析3：找出关注人数最多的五套房子

house['attention']=house['followInfo'].apply(data_ad,str='人关注')

house.head(5)

house.sort_values(by='attention',ascending=False).head()

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	attention
47	47	弘善家园南向开间，满两年，免增值税	弘善家园	中楼层(共28层)2009年建塔楼	1室0厅	42.64	中楼层(共28层)2009年建塔楼	距离10号线十里河站698米房本满两年随时看房	265	62149	1401人关注 / 共305次带看 / 一年前发布	1401.0
2313	2313	四惠东康家园南向一居室地铁1号线出行房主自荐	康家园	顶层(共6层)1995年建板楼	1室1厅	41.97	顶层(共6层)1995年建板楼	距离1号线四惠东站974米房本满五年随时看房	262	62426	1005人关注 / 共86次带看 / 6个月以前发布	1005.0
990	990	远见名苑东南两居满五年家庭唯一住房诚心出售房主自荐	远见名苑	中楼层(共24层)2004年建塔楼	2室1厅	90.14	中楼层(共24层)2004年建塔楼	距离7号线达官营站516米房本满五年	811	89972	979人关注 / 共50次带看 / 8个月以前发布	979.0
2331	2331	荣丰二期朝南复式无遮挡全天采光房主自荐	荣丰2008	中楼层(共10层)2005年建塔楼	1室1厅	32.54	中楼层(共10层)2005年建塔楼	距离7号线达官营站1028米房本满五年随时看房	400	122926	972人关注 / 共369次带看 / 6个月以前发布	972.0
915	915	通州万达北苑地铁站天时名苑大两居可改3居	天时名苑	顶层(共9层)2009年建板塔结合	2室2厅	121.30	顶层(共9层)2009年建板塔结合	距离八通线通州北苑站602米房本满五年	645	53174	894人关注 / 共228次带看 / 8个月以前发布	894.0

数据分析4：户型和关注人数分布

#取户型>50的数据进行可视化

type_interest_group=house.groupby(house['housetype']).agg({'housetype':'count','attention':'sum'})

interest_sort=type_interest_group[type_interest_group['housetype']>50]

interest_sort.plot(kind='barh',title='二手房户型和关注人数分布', y='attention')

interest_sort

	housetype	attention
housetype
1室0厅	620	32920.0
1室1厅	2472	141893.0
1室2厅	64	2614.0
1房间1卫	87	2267.0
2室1厅	6582	394987.0
2室2厅	1018	49526.0
2房间1卫	100	3006.0
3室1厅	2534	162205.0
3室2厅	1424	81140.0
4室1厅	181	10667.0
4室2厅	496	30661.0
4室3厅	55	2846.0
5室2厅	92	4703.0

数据分析5：面积分布

# 面积分布

area_level=[0,50,100,150,200,250,300,350,400,450,500]

label_level=['小于50','50-100','100-150','150-200','200-250','250-300','300-350','350-400','400-450','450-500']

area_cut=pd.cut(house['square'],bins=area_level,labels=label_level)

area_cut.value_counts()[::-1].plot(kind='barh', title='二手房面积分布',fontsize='small')

<matplotlib.axes._subplots.AxesSubplot at 0x12dae064a90>

数据分析6：各个行政区房源单价均价

house_unitPrice=house_detail.groupby('district')['unitPrice'].mean()

house_unitPrice.plot(kind='barh', title='各个行政区房源均价')

# agg({'unitPrice':'mean'})

<matplotlib.axes._subplots.AxesSubplot at 0x12dae19c7f0>

各个行政区房源价钱箱线图绘制

import seaborn as sns

price=house_detail[['district','unitPrice']]

price.boxplot(by='district', grid=0)

<matplotlib.axes._subplots.AxesSubplot at 0x12dae1ede80>

各个行政区房源在售数量

house_onsale=house_detail.groupby('district')['onsale'].count()

house_onsale.plot(kind='bar',rot=30,title='各个行政区房源在售数量')

<matplotlib.axes._subplots.AxesSubplot at 0x12dae3c9fd0>

数据分析7：各个行政区的房源总价对比

price=house_detail[['district','totalPrice']]

sns.boxplot(x='district',y='totalPrice',data=price)

plt.ylim((0,6000))

(0, 6000)

通过箱型图看到，各大区域房屋总价中位数都都在1000万以下，且房屋总价离散值较高

数据分析8：按照地铁信息对各个区域每平米均价排序，柱形图绘制

bizcircle_unitPrice=house_detail.groupby('bizcircle')['unitPrice'].mean().sort_values(ascending=False)

bizcircle_unitPrice.head(15).plot(kind='bar',title='各个区域均价分布',rot=30)

plt.legend(['均价'])

# plt.show()

<matplotlib.legend.Legend at 0x12daceabc50>

数据分析9：按小区均价排序

community_unitPrice=house_detail.groupby('community')['unitPrice'].mean().sort_values(ascending=False)

community_unitPrice.head(10).plot(kind='bar',title='各个小区均价分布',rot=30)

plt.legend(['均价'])

<matplotlib.legend.Legend at 0x12dadf03898>

数据分析10：楼层的分布情况

# 将字符串转换成数字

def data_ads(select_data,str):

    if str in select_data:

       return (select_data[0:select_data.find(str)])

    else:

       return '没有提取到楼层信息'

# 得到楼层

# 将字符串转换成数字

house['layer']=house['years'].apply(data_ads,str='(')

house.head(3)

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	attention	layer
0	0	宝星华庭一层带花园，客厅挑高，通透四居室。房主自荐	宝星国际三期	底层(共22层)2010年建板塔结合	4室1厅	298.79	底层(共22层)2010年建板塔结合	距离15号线望京东站680米房本满五年	2598	86951	53人关注 / 共44次带看 / 一年前发布	53.0	底层
1	1	三面采光全明南北朝向正对小区绿地花园	顶秀青溪	中楼层(共11层)2008年建板塔结合	3室2厅	154.62	中楼层(共11层)2008年建板塔结合	距离5号线立水桥站1170米房本满两年随时看房	1000	64675	323人关注 / 共579次带看 / 一年前发布	323.0	中楼层
2	2	沁园公寓三居室距离苏州街地铁站383米	沁园公寓	低楼层(共24层)1999年建塔楼	3室2厅	177.36	低楼层(共24层)1999年建塔楼	距离10号线苏州街站383米房本满五年	1200	67659	185人关注 / 共108次带看 / 一年前发布	185.0	低楼层

# 楼层分布及可视化

house['layer'].value_counts().plot(kind='bar',rot=30)

<matplotlib.axes._subplots.AxesSubplot at 0x12dae010da0>

数据分析11：绘制2000到2016平均房价（年份与总售价的可视化）

# 得到年份

def data_adst(select_data,str):

    if str in select_data:

        return (select_data[select_data.find(str)-5:select_data.find(str)])

    else:

        return None

house['year']=house['years'].apply(data_adst,str='建')

house.head(4)

	index	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	attention	layer	year
0	0	宝星华庭一层带花园，客厅挑高，通透四居室。房主自荐	宝星国际三期	底层(共22层)2010年建板塔结合	4室1厅	298.79	底层(共22层)2010年建板塔结合	距离15号线望京东站680米房本满五年	2598	86951	53人关注 / 共44次带看 / 一年前发布	53.0	底层	2010年
1	1	三面采光全明南北朝向正对小区绿地花园	顶秀青溪	中楼层(共11层)2008年建板塔结合	3室2厅	154.62	中楼层(共11层)2008年建板塔结合	距离5号线立水桥站1170米房本满两年随时看房	1000	64675	323人关注 / 共579次带看 / 一年前发布	323.0	中楼层	2008年
2	2	沁园公寓三居室距离苏州街地铁站383米	沁园公寓	低楼层(共24层)1999年建塔楼	3室2厅	177.36	低楼层(共24层)1999年建塔楼	距离10号线苏州街站383米房本满五年	1200	67659	185人关注 / 共108次带看 / 一年前发布	185.0	低楼层	1999年
3	3	金星园东南向户型，四居室设计，中间楼层	金星园	中楼层(共28层)2007年建塔楼	4室2厅	245.52	中楼层(共28层)2007年建塔楼	距离机场线三元桥站1153米房本满五年	1650	67205	157人关注 / 共35次带看 / 一年前发布	157.0	中楼层	2007年

# 绘制2000到2016平均房价

data=house.groupby('year').agg({'totalPrice':'mean'})

data

data['2000年':'2016年'].plot(kind='bar',rot=30)

<matplotlib.axes._subplots.AxesSubplot at 0x12dae301f98>

综合：紧邻望京地铁站,三室一厅，400万-500万，大于80平米的房子

第一步：找出望京附近的房屋信息

myhouse=house_detail[house_detail.bizcircle.str.contains('望京')]

# myhouse.head(2)

len(myhouse)

第二步：查看分布情况

house_type=myhouse['housetype'].value_counts()

house_type.head(10).plot(kind='bar',title='户型数量分布',rot=30)

house_type.head(10)

2室1厅     230

3室2厅     155

2室2厅     134

1室1厅     117

3室1厅     108

4室2厅      55

1室0厅      25

4室1厅      25

2房间1卫     13

5室2厅       8

Name: housetype, dtype: int64

第三步：找到三室一厅的房源信息以及400万-500万，大于80平米的房源信息

# 1 找到三室一厅的房源信息

myhouse=myhouse[myhouse.housetype.str.contains('3室1厅')]

len(myhouse)

# 2 房屋总价400万-500万之间

myhouse=myhouse.loc[(myhouse['totalPrice']>400)&(myhouse['totalPrice']<500)]

myhouse.head()

len(myhouse)

# 将字符串转换成数字

# def data_ad(select_data,str):

#     if str in select_data:

#        return float(select_data[0:select_data.find(str)])

#     else:

#        return None

# 处理房屋面积数据

myhouse['square']=myhouse['square'].apply(data_ad,str='平米')

# 3 房屋面积大于80平米

myhouse=myhouse.loc[myhouse.square>80]

len(myhouse)

myhouse.head()

	index_x	title	community	years	housetype	square	floor	taxtype	totalPrice	unitPrice	followInfo	index_y	id	district	bizcircle	tagList	onsale
7824	2806	花家地西里一区东西向三居室中间楼层带电梯房主自荐	花家地西里一区	中楼层(共12层)1997年建板塔结合	3室1厅	82.10	中楼层(共12层)1997年建板塔结合	房本满五年随时看房	480	58466	245人关注 / 共75次带看 / 5个月以前发布	820	1111027375067	朝阳	望京	近地铁14号线(东段)阜通站	8
14022	8669	经典三居室格局合理社区安静配套成熟房主自荐	中环南路5号院	顶层(共6层)1996年建板楼	3室1厅	88.51	顶层(共6层)1996年建板楼	距离14号线(东段)望京南站701米房本满两年	495	55926	35人关注 / 共0次带看 / 2个月以前发布	4718	1111027382477	朝阳	望京	近地铁14号线(东段)望京南站	1

项目五：电信流失用户数据分析与可视化

手机客户流失预测

# 导入分析用到的模块

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

# %matplotlib inline 可以在Ipython编译器里直接使用，功能是可以内嵌绘图，并且可以省略掉plt.show()这一步。

plt.style.use('ggplot')

import seaborn as sns

sns.set_style('darkgrid')

sns.set_palette('muted')

# 导入csv文件

df = pd.read_excel('CustomerSurvival.xlsx',encoding='utf-8')

df.head()

	ID	套餐金额	额外通话时长	额外流量	服务合约	集团用户	使用月数	流失用户
0	1	1	792.833333	-10.450067	0	0	25	0
1	2	1	121.666667	-21.141117	0	0	25	0
2	3	1	-30.000000	-25.655273	0	0	2	1
3	4	1	241.500000	-288.341254	1	1	25	0
4	5	1	1629.666667	-23.655505	0	1	25	0

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4975 entries, 0 to 4974

Data columns (total 10 columns):

ID        4975 non-null int64

套餐金额      4975 non-null int64

额外通话时长    4975 non-null float64

额外流量      4975 non-null float64

改变行为      4975 non-null int64

服务合约      4975 non-null int64

关联购买      4975 non-null int64

集团用户      4975 non-null int64

使用月数      4975 non-null int64

流失用户      4975 non-null int64

dtypes: float64(2), int64(8)

memory usage: 388.8 KB

df.columns = ['id','pack_type','extra_time','extra_flow','pack_change',

             'contract','asso_pur','group_user','use_month','loss']

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 4975 entries, 0 to 4974

Data columns (total 10 columns):

id             4975 non-null int64

pack_type      4975 non-null int64

extra_time     4975 non-null float64

extra_flow     4975 non-null float64

pack_change    4975 non-null int64

contract       4975 non-null int64

asso_pur       4975 non-null int64

group_user     4975 non-null int64

use_month      4975 non-null int64

loss           4975 non-null int64

dtypes: float64(2), int64(8)

memory usage: 388.8 KB

id – 用户的唯一标识

pack_type – 用户的月套餐的金额，1为96元以下，2为96到225元，3为225元以上

extra_time – 用户在使用期间的每月额外通话时长，这部分需要用户额外交费。数值是每月的额外通话时长的平均值,单位：分钟

extra_flow – 用户在使用期间的每月额外流量，这部分需要用户额外交费。数值是每月的额外流量的平均值，单位：兆

pack_change – 是否曾经改变过套餐金额，1=是，0=否

contract – 用户是否与联通签订过服务合约，1=是，0=否

asso_pur – 用户在使用联通移动服务过程中是否还同时办理其他业务，1=同时办理一项其他业务，2=同时办理两项其他业务，0=没有办理其他业务

group_use – 用户办理的是否是集团业务，相比个人业务，集体办理的号码在集团内拨打有一定优惠。1=是，0=否

use_month – 截止到观测期结束（2012.1-2014.1），用户使用联通服务的时间长短，单位：月

loss – 在25个月的观测期内，用户是否已经流失。1=是，0=否

【数据的探索性分析】

df.describe()

	id	pack_type	extra_time	extra_flow	pack_change	contract	asso_pur	group_user	use_month	loss
count	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000	4975.000000
mean	2488.000000	1.057688	258.520030	-71.580403	0.021307	0.245226	0.047437	0.227337	14.774271	0.782714
std	1436.303125	0.258527	723.057190	275.557448	0.144419	0.430264	0.278143	0.419154	6.534273	0.412441
min	1.000000	1.000000	-2828.333333	-2189.875986	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
25%	1244.500000	1.000000	-126.666667	-74.289824	0.000000	0.000000	0.000000	0.000000	13.000000	1.000000
50%	2488.000000	1.000000	13.500000	-59.652734	0.000000	0.000000	0.000000	0.000000	13.000000	1.000000
75%	3731.500000	1.000000	338.658333	-25.795045	0.000000	0.000000	0.000000	0.000000	19.000000	1.000000
max	4975.000000	3.000000	4314.000000	2568.704293	1.000000	1.000000	2.000000	1.000000	25.000000	1.000000

可以看到extra_time和extra_flow有正负值，正数表示用户有额外的通话时长和流量，负数为用户在月底时剩余的套餐时长和流量。从四分位数中可看出超过一半的用户有额外通话时间，流量的话只有小部分用户超额使用了。另外其他的分类型变量在描述统计上并未发现有异常的地方。

在这里特别注意下use_month这个变量，数据的观测区间为2012.1-2014.1，一共25个月，且案例中关于流失的定义为：

超过一个月没有使用行为（包括通话，使用流量）的用户判定为流失。

在数据集中use_month小于25个月的基本都是流失状态，所以这个变量对于流失的预测并没有什么关键作用，后续导入模型时需剔除这个变量。

2. 变量的分布

# 首先看一下两个连续型变量：extra_time和extra_flow的数据分布：

plt.figure(figsize = (10,5))

plt.subplot(121)

df.extra_time.hist(bins = 30)

plt.subplot(122)

df.extra_flow.hist(bins = 30)

<matplotlib.axes._subplots.AxesSubplot at 0x1594afe2cf8>

extra_time呈现的是右偏分布，extra_flow近似服从正态分布，与描述统计中的情况大致吻合

# 接下来看看分类型变量的分布：

# 以bar的形式展示每个类别的数量

# 金额  是否曾经改变过套餐金额    用户是否与联通签订过服务合约

# 用户在使用联通移动服务过程中是否还同时办理其他业务   用户办理的是否是集团业务   流失

fig,axes = plt.subplots(nrows = 2,ncols = 3, figsize = (10,6))

sns.countplot(x = 'pack_type',data = df,ax=axes[0,0])

sns.countplot(x = 'pack_change',data = df,ax=axes[0,1])

sns.countplot(x = 'contract',data = df,ax=axes[0,2])

sns.countplot(x = 'asso_pur',data = df,ax=axes[1,0])

sns.countplot(x = 'group_user',data = df,ax=axes[1,1])

sns.countplot(x = 'loss',data = df,ax=axes[1,2])

<matplotlib.axes._subplots.AxesSubplot at 0x1594b1aa240>

可以看到pack_type, pack_change, asso_pur的类型分布非常不均衡，例如asso_pur,办理过套餐外业务的用户数量极少，导致样本缺乏足够的代表性，可能会对模型的最终结果产生一定的影响。

3. 自变量与因变量之间的关系：

# 对于extra_time和extra_flow绘制散点图观察：

plt.figure(figsize = (10,6))

df.plot.scatter(x='extra_time',y='loss')

df.plot.scatter(x='extra_flow',y='loss')

<matplotlib.axes._subplots.AxesSubplot at 0x1594b23de80>

<Figure size 720x432 with 0 Axes>

从散点图上似乎感觉两个自变量与是否流失并无关系，为了更好的展示其相关性，我们对extra_time和extra_flow进行分箱处理，再绘制条形图：

# 增加分箱后的两个字段

# 将连续性数据离散化

bin1 = [-3000,-2000,-500,0,500,2000,3000,5000]

df['time_label'] = pd.cut(df.extra_time,bins = bin1)

# 观察一下分箱后的数据分布

time_amount = df.groupby('time_label').id.count().sort_values().reset_index()

time_amount

time_amount['amount_cumsum'] = time_amount.id.cumsum()

time_amount

	time_label	id	amount_cumsum
0	(-3000, -2000]	3	3
1	(-2000, -500]	15	18
2	(3000, 5000]	79	97
3	(2000, 3000]	129	226
4	(500, 2000]	755	981
5	(0, 500]	1634	2615
6	(-500, 0]	2360	4975

sns.countplot(x = 'time_label',hue = 'loss',data =df)

# ---对extra_time进行累加统计，发现【-500,500】这个区间的用户占了80%，符合二八定律

# hue可以返回每一个区间的loss

<matplotlib.axes._subplots.AxesSubplot at 0x1594b3a1d68>

bin2 = [-3000,-2000,-500,0,500,2000,3000]

df['flow_label'] = pd.cut(df.extra_flow,bins = bin2)

flow_amount = df.groupby('flow_label').id.count().sort_values().reset_index()

flow_amount['amount_cumsum'] = flow_amount.id.cumsum()

flow_amount

	flow_label	id	amount_cumsum
0	(2000, 3000]	1	1
1	(-3000, -2000]	3	4
2	(500, 2000]	79	83
3	(-2000, -500]	157	240
4	(0, 500]	827	1067
5	(-500, 0]	3908	4975

—对extra_flow进行累加统计，发现【-500,500】占了95%，且（-500,0】的用户占80%，可以说只有小部分用户每月会超额使用流量。

sns.countplot(x = 'flow_label',hue = 'loss',data =df)

# 使用bars来表示每个分类数据的数目

<matplotlib.axes._subplots.AxesSubplot at 0x1594b2abd30>

可以明显的看出用户使用的通话时间和流量越多，流失概率越低，这些超额使用的用户在用户分类中属于’高价值用户’，用户粘性很高，运营商应该把重点放在这些用户身上，采取有效的手段预防其流失。

fig,axes = plt.subplots(nrows = 2,ncols = 3, figsize = (12,8))

sns.countplot(x = 'pack_type',hue = 'loss',data =df,ax = axes[0][0])

sns.countplot(x = 'pack_change',hue = 'loss',data =df,ax = axes[0][1])

sns.countplot(x = 'contract',hue = 'loss',data =df,ax = axes[0][2])

sns.countplot(x = 'asso_pur',hue = 'loss',data =df,ax = axes[1][0])

sns.countplot(x = 'group_user',hue = 'loss',data =df,ax = axes[1][1])

<matplotlib.axes._subplots.AxesSubplot at 0x1594ad745c0>

初步得出以下结论：

1）.套餐金额越大，用户越不易流失，套餐金额大的用户忠诚度也高

2）.改过套餐的用户流失的概率变小

3）.签订过合约的流失比例较小，签订合约也意味着一段时间内（比如2年，3年）用户一般都不会更换运营商号码，可以说签订合约的用户比较稳定

4）.办理过其它套餐业务的用户因样本量太少，后续再研究

5）.集团用户的流失率相比个人用户低很多

internal_chars = ['extra_time','extra_flow','pack_type',

                 'pack_change','contract','asso_pur','group_user','loss']

corrmat = df[internal_chars].corr()

f, ax = plt.subplots(figsize=(10, 7))

plt.xticks(rotation='0')

sns.heatmap(corrmat, square=False, linewidths=.5, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1594b332470>

各自变量之间的相关性程度很低，排除了共线性问题。在对因变量的相关性上contract和group_user的系数相比其它变量较高，但也不是很强。

数据建模

因为自变量大多数为分类型，所以用决策树的效果比较好，而且决策树对异常值的敏感度很低，生成的结果也有很好的解释性。

—因变量是 ‘loss’，是否流失，也是我们预测的目标值

—自变量分为三类：

#连续型变量：extra_time，extra_flow, use_month

#二元分类变量：pack_change，contract, group_use

#多元分类变量：pack_type，asso_pur

根据前面的探索性分析，并基于业务理解，我们决定筛选这几个特征进入模型：

extra_time，extra_flow，pack_type, pack_change, asso_pur

contract以及group_use，这些特征都对是否流失有一定的影响。

对于extra_time，extra_flow这两个连续型变量我们作数据转换，变成二分类变量，这样所有特征都是统一的度量。

df['time_tranf'] = df.apply(lambda x:1 if x.extra_time>0 else 0,axis =1)

df['flow_tranf'] = df.apply(lambda x:1 if x.extra_flow>0 else 0,axis =1)

df.head()

# 将没有超出套餐的通话时间和流量记为0，超出的记为1。

	id	pack_type	extra_time	extra_flow	contract	group_user	use_month	loss	time_label	flow_label	time_tranf
0	1	1	792.833333	-10.450067	0	0	25	0	(500, 2000]	(-500, 0]	1
1	2	1	121.666667	-21.141117	0	0	25	0	(0, 500]	(-500, 0]	1
2	3	1	-30.000000	-25.655273	0	0	2	1	(-500, 0]	(-500, 0]	0
3	4	1	241.500000	-288.341254	1	1	25	0	(0, 500]	(-500, 0]	1
4	5	1	1629.666667	-23.655505	0	1	25	0	(500, 2000]	(-500, 0]	1

x = df.loc[:,['pack_type','time_tranf','flow_tranf','pack_change','contract','asso_pur','group_user']]

x = np.array(x)

x

array([[1, 1, 0, ..., 0, 0, 0],

       [1, 1, 0, ..., 0, 0, 0],

       [1, 0, 0, ..., 0, 0, 0],

       ...,

       [1, 1, 1, ..., 1, 0, 0],

       [1, 1, 1, ..., 1, 0, 0],

       [3, 0, 0, ..., 1, 0, 1]], dtype=int64)

y = df.loss

y = y[:, np.newaxis]

y

array([[0],

       [0],

       [1],

       ...,

       [0],

       [1],

       [0]], dtype=int64)

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3,random_state=123)

from sklearn import tree

clf = tree.DecisionTreeClassifier(criterion='gini', #--设置衡量的系数

                                    splitter='best', #--选择分类的策略

                                    max_depth=4, #--设置树的最大深度

                                    min_samples_split=10,#--节点的最少样本数

                                    min_samples_leaf=5 #-- 叶节点的最少样本数

                                    )

clf = clf.fit(x_train,y_train) # -- 拟合训练

这里我们采用决策树中ID3算法，基于entropy系数进行分类，设置树的最大深度为4，区分一个内部节点需要的最少的样本数为10，一个叶节点所需要的最小样本数为5。

train_score = clf.score(x_train,y_train) # 训练集的评分

test_score = clf.score(x_test,y_test)   # 测试集的评分

'train_score:{0},test_score:{1}'.format(train_score,test_score)

'train_score:0.871338311315336,test_score:0.8640321500334897'

参数调优

# 模型的参数调优--max_depth

# 创建一个函数，使用不同的深度来训练模型，并计算评分数据

def cv_score(d):

    clf2 = tree.DecisionTreeClassifier(max_depth=d)

    clf2 = clf2.fit(x_train,y_train)

    tr_score = clf2.score(x_train,y_train)

    cv_score = clf2.score(x_test,y_test)

    return (tr_score, cv_score)

# 构造参数范围，在这个范围内构造模型并计算评分

depths = range(2,15)

scores = [cv_score(d) for d in depths]

tr_scores = [s[0] for s in scores]

cv_scores = [s[1] for s in scores]

scores

# 找出交叉验证数据集最高评分的那个索引

best_score_index = np.argmax(cv_scores)

best_score = cv_scores[best_score_index]

best_param = depths[best_score_index]

best_param

# best_score

plt.figure(figsize = (4,2),dpi=150)

plt.grid()

plt.xlabel('max_depth')

plt.ylabel('best_score')

plt.plot(depths, cv_scores,'.g-',label = 'cross_validation scores')

plt.plot(depths,tr_scores,'.r--',label = 'train scores')

plt.legend()

<matplotlib.legend.Legend at 0x1594bb49518>

在生成的图中可以看出当深度为4时，交叉验证数据集的评分与训练集的评分比较接近，且两者的评分比较高，当深度超过5以后，俩者的差距变大，交叉验证数据集的评分变低，出现了过拟合情况。

模型结果评价

from sklearn.metrics import classification_report

y_pre = clf.predict(x_test)

print(classification_report(y_pre,y_test))

              precision    recall  f1-score   support

           0       0.65      0.71      0.68       304

           1       0.92      0.90      0.91      1189

    accuracy                           0.86      1493

   macro avg       0.79      0.81      0.80      1493

weighted avg       0.87      0.86      0.87      1493

精确率 = TP/(TP+FP) ：在预测为流失的用户中，预测正确的（实际也是流失）用户占比

召回率 = TP/(TP+FN) : 在实际为流失的用户中，预测正确的（预测为流失的）用户占比

F1值为精确率和召回率的调和均值，相当于这两个的综合评价指标。

通过输出的分析报告可以得出建立的预测模型的精确率为0.88，说明在预测为流失的用户中，实际流失的用户占88%，召回率为0.86，说明实际为流失的用户中，预测为流失的占86%，F1值为0.87，说明模型的综合评价还不错。

# import os

# os.environ["PATH"] += os.pathsep + 'D:\新建文件夹\graphviz-2.38\release\bin'  #注意修改你的路径

# import os

#  os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

from IPython.display import Image

from sklearn import tree

import pydotplus

from sklearn.tree import export_graphviz

def TreeShow(dtClass,irisDataSet):

    dot_data = export_graphviz(dtClass, out_file=None)

    graph = pydotplus.graph_from_dot_data(dot_data)

    graph.write_pdf("tree.pdf")

    dot_data = export_graphviz(dtClass, out_file=None,

                               feature_names=['pack_type','time_tranf','flow_tranf'

                               ,'pack_change','contract','asso_pur','group_user'],   #对应特征的名字

                              class_names=['loss','not loss'],    #对应类别的名字

                               filled=True, rounded=True,

                               special_characters=True)

    graph = pydotplus.graph_from_dot_data(dot_data)

    Image(graph.create_png())

TreeShow(clf,df)