kaggle竞赛-保险转化-homesite

时间格式的转化
查看数据类型
查看DataFrame的详细信息
填充缺失值
category 数据类型转化
模型参数设定
结论

该项目是针对kaggle中的homesite进行的算法预测，使用xgboost的sklearn接口，进行数据建模，购买预测。

import pandas as pd

import numpy as np

import xgboost as xgb

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import GridSearchCV

train = pd.read_csv('train.csv')

test = pd.read_csv('test.csv')

train.head()

	QuoteNumber	Original_Quote_Date	Field6	Field7	Field8	Field9	Field10	Field11	Field12	...	GeographicField59A	GeographicField59B	GeographicField60A	GeographicField60B	GeographicField61A	GeographicField61B	GeographicField62A	GeographicField62B	GeographicField63	GeographicField64
0	1	2013-08-16	B	23	0.9403	0.0006	965	1.0200	N	...	9	9	-1	8	-1	18	-1	10	N	CA
1	2	2014-04-22	F	7	1.0006	0.0040	548	1.2433	N	...	10	10	-1	11	-1	17	-1	20	N	NJ
2	4	2014-08-25	F	7	1.0006	0.0040	548	1.2433	N	...	15	18	-1	21	-1	11	-1	8	N	NJ
3	6	2013-04-15	J	10	0.9769	0.0004	1,165	1.2665	N	...	6	5	-1	10	-1	9	-1	21	N	TX
4	8	2014-01-25	E	23	0.9472	0.0006	1,487	1.3045	N	...	18	22	-1	10	-1	11	-1	12	N	IL

5 rows × 299 columns

train=train.drop('QuoteNumber',axis=1)

test = test.drop('QuoteNumber', axis=1)

时间格式的转化

train['Date']=pd.to_datetime(train['Original_Quote_Date'])

train= train.drop('Original_Quote_Date',axis=1)

test['Date']=pd.to_datetime(test['Original_Quote_Date'])

test= test.drop('Original_Quote_Date',axis=1)

train['year']=train['Date'].dt.year

train['month']=train['Date'].dt.month

train['weekday']=train['Date'].dt.weekday

train.head()

	Field6	Field7	Field8	Field9	Field10	Field11	Field12	CoverageField1A	CoverageField1B	...	GeographicField61A	GeographicField61B	GeographicField62A	GeographicField62B	GeographicField63	GeographicField64	Date	year	month	weekday
0	B	23	0.9403	0.0006	965	1.0200	N	17	23	...	-1	18	-1	10	N	CA	2013-08-16	2013	8	4
1	F	7	1.0006	0.0040	548	1.2433	N	6	8	...	-1	17	-1	20	N	NJ	2014-04-22	2014	4	1
2	F	7	1.0006	0.0040	548	1.2433	N	7	12	...	-1	11	-1	8	N	NJ	2014-08-25	2014	8	0
3	J	10	0.9769	0.0004	1,165	1.2665	N	3	2	...	-1	9	-1	21	N	TX	2013-04-15	2013	4	0
4	E	23	0.9472	0.0006	1,487	1.3045	N	8	13	...	-1	11	-1	12	N	IL	2014-01-25	2014	1	5

5 rows × 301 columns

test['year']=test['Date'].dt.year

test['month']=test['Date'].dt.month

test['weekday']=test['Date'].dt.weekday

train = train.drop('Date', axis=1)

test = test.drop('Date', axis=1)

查看数据类型

train.dtypes

QuoteConversion_Flag      int64

Field6                   object

Field7                    int64

Field8                  float64

Field9                  float64

Field10                  object

Field11                 float64

Field12                  object

CoverageField1A           int64

CoverageField1B           int64

CoverageField2A           int64

CoverageField2B           int64

CoverageField3A           int64

CoverageField3B           int64

CoverageField4A           int64

CoverageField4B           int64

CoverageField5A           int64

CoverageField5B           int64

CoverageField6A           int64

CoverageField6B           int64

CoverageField8           object

CoverageField9           object

CoverageField11A          int64

CoverageField11B          int64

SalesField1A              int64

SalesField1B              int64

SalesField2A              int64

SalesField2B              int64

SalesField3               int64

SalesField4               int64

                         ...

GeographicField50B        int64

GeographicField51A        int64

GeographicField51B        int64

GeographicField52A        int64

GeographicField52B        int64

GeographicField53A        int64

GeographicField53B        int64

GeographicField54A        int64

GeographicField54B        int64

GeographicField55A        int64

GeographicField55B        int64

GeographicField56A        int64

GeographicField56B        int64

GeographicField57A        int64

GeographicField57B        int64

GeographicField58A        int64

GeographicField58B        int64

GeographicField59A        int64

GeographicField59B        int64

GeographicField60A        int64

GeographicField60B        int64

GeographicField61A        int64

GeographicField61B        int64

GeographicField62A        int64

GeographicField62B        int64

GeographicField63        object

GeographicField64        object

year                      int64

month                     int64

weekday                   int64

Length: 300, dtype: object

查看DataFrame的详细信息

train.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 260753 entries, 0 to 260752

Columns: 300 entries, QuoteConversion_Flag to weekday

dtypes: float64(6), int64(267), object(27)

memory usage: 596.8+ MB

填充缺失值

train = train.fillna(-999)

test = test.fillna(-999)

category 数据类型转化

from sklearn import preprocessing

features = list(train.columns[1:])

for i in features:

    if train[i].dtype=='object':

        le=preprocessing.LabelEncoder()

        le.fit(list(train[i].values)+list(test[i].values))

        train[i] = le.transform(list(train[i].values))

        test[i] = le.transform(list(test[i].values))

模型参数设定

#brute force scan for all parameters, here are the tricks

#usually max_depth is 6,7,8

#learning rate is around 0.05, but small changes may make big diff

#tuning min_child_weight subsample colsample_bytree can have

#much fun of fighting against overfit

#n_estimators is how many round of boosting

#finally, ensemble xgboost with multiple seeds may reduce variance

xgb_model = xgb.XGBClassifier()

parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower

              'objective':['binary:logistic'],

              'learning_rate': [0.05,0.1], #so called `eta` value

              'max_depth': [6],

              'min_child_weight': [11],

              'silent': [1],

              'subsample': [0.8],

              'colsample_bytree': [0.7],

              'n_estimators': [5], #number of trees, change it to 1000 for better results

              'missing':[-999],

              'seed': [1337]}

sfolder = StratifiedKFold(n_splits=5,random_state=42,shuffle=True)

clf= GridSearchCV(xgb_model,parameters,n_jobs=4,cv=sfolder.split(train[features], train["QuoteConversion_Flag"]),scoring='roc_auc',

                   verbose=2, refit=True,return_train_score=True)

clf.fit(train[features], train["QuoteConversion_Flag"])

Fitting 5 folds for each of 2 candidates, totalling 10 fits

[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:  2.4min finished

GridSearchCV(cv=<generator object _BaseKFold.split at 0x0000000018459888>,

       error_score='raise',

       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,

       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,

       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,

       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,

       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,

       silent=True, subsample=1),

       fit_params=None, iid=True, n_jobs=4,

       param_grid={'nthread': [4], 'objective': ['binary:logistic'], 'learning_rate': [0.05, 0.1], 'max_depth': [6], 'min_child_weight': [11], 'silent': [1], 'subsample': [0.8], 'colsample_bytree': [0.7], 'n_estimators': [5], 'missing': [-999], 'seed': [1337]},

       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,

       scoring='roc_auc', verbose=2)

clf.grid_scores_

c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20

  DeprecationWarning)

[mean: 0.94416, std: 0.00118, params: {'colsample_bytree': 0.7, 'learning_rate': 0.05, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8},

 mean: 0.94589, std: 0.00120, params: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_weight': 11, 'missing': -999, 'n_estimators': 5, 'nthread': 4, 'objective': 'binary:logistic', 'seed': 1337, 'silent': 1, 'subsample': 0.8}]

pd.DataFrame(clf.cv_results_['params'])

	colsample_bytree	learning_rate	max_depth	min_child_weight	missing	n_estimators	nthread	objective	seed	silent	subsample
0	0.7	0.05	6	11	-999	5	4	binary:logistic	1337	1	0.8
1	0.7	0.10	6	11	-999	5	4	binary:logistic	1337	1	0.8

best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])

print('Raw AUC score:', score)

for param_name in sorted(best_parameters.keys()):

    print("%s: %r" % (param_name, best_parameters[param_name]))

Raw AUC score: 0.9458947562485674

colsample_bytree: 0.7

learning_rate: 0.1

max_depth: 6

min_child_weight: 11

missing: -999

n_estimators: 5

nthread: 4

objective: 'binary:logistic'

seed: 1337

silent: 1

subsample: 0.8

c:\anaconda3\envs\nlp\lib\site-packages\sklearn\model_selection\_search.py:761: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20

  DeprecationWarning)

test_probs = clf.predict_proba(test[features])[:,1]

sample = pd.read_csv('sample_submission.csv')

sample.QuoteConversion_Flag = test_probs

sample.to_csv("xgboost_best_parameter_submission.csv", index=False)

clf.best_estimator_.predict_proba(test[features])

array([[0.6988076 , 0.3011924 ],

       [0.6787684 , 0.3212316 ],

       [0.6797658 , 0.32023418],

       ...,

       [0.5018287 , 0.4981713 ],

       [0.6988076 , 0.3011924 ],

       [0.62464744, 0.37535256]], dtype=float32)

下面的截断值0.5可以自己根据实际的项目设定截断值

kears_result=pd.read_csv('keras_nn_test.csv')

result1=[1 if i>0.5 else 0 for i in kears_result['QuoteConversion_Flag']]

xgb_result=pd.read_csv('xgboost_best_parameter_submission.csv')

result2=[1 if i>0.5 else 0 for i in xgb_result['QuoteConversion_Flag']]

from sklearn import metrics

metrics.accuracy_score(result1,result2)

0.8566004740099864

metrics.confusion_matrix(result1,result2)

array([[148836,  24862],

       [    66,     72]], dtype=int64)

结论

对数据的时间进行了预处理
对数据中的category类型进行了label化，我觉得有必要对这个进行重新考虑，个人觉得应该使用one-hot进行category的处理，而不是LabelEncoder处理（疑虑）
Label encoding在某些情况下很有用，但是场景限制很多。再举一例：比如有[dog,cat,dog,mouse,cat]，我们把其转换为[1,2,1,3,2]。这里就产生了一个奇怪的现象：dog和mouse的平均值是cat。所以目前还没有发现标签编码的广泛使用。
得到的模型对测试集进行处理，Raw AUC 0.94，而对应的准确率只有85%，实际上并没有实际的分类效果，对于实际上是0的，预测成1的太多了，也就是假阳性太高了，实际中的转换率也不会很高。
其实模型还有很多可以调整的参数都没有调整，如果对调参有兴趣的可以查看美团的文本分类项目中的例子。

kaggle竞赛-保险转化-homesite的更多相关文章

《Python机器学习及实践：从零开始通往Kaggle竞赛之路》
<Python 机器学习及实践–从零开始通往kaggle竞赛之路>很基础主要介绍了Scikit-learn,顺带介绍了pandas.numpy.matplotlib.scipy. 本书代 ...
如何使用Python在Kaggle竞赛中成为Top15
如何使用Python在Kaggle竞赛中成为Top15 Kaggle比赛是一个学习数据科学和投资时间的非常的方式,我自己通过Kaggle学习到了很多数据科学的概念和思想,在我学习编程之后的几个月就开始 ...
初窥Kaggle竞赛
初窥Kaggle竞赛原文地址: https://www.dataquest.io/mission/74/getting-started-with-kaggle 1: Kaggle竞赛我们接下来将要 ...
《机器学习及实践--从零开始通往Kaggle竞赛之路》
<机器学习及实践--从零开始通往Kaggle竞赛之路> 在开始说之前一个很重要的Tip:电脑至少要求是64位的,这是我的痛. 断断续续花了个把月的时间把这本书过了一遍.这是一本非常适合基于 ...
由Kaggle竞赛wiki文章流量预测引发的pandas内存优化过程分享
pandas内存优化分享缘由最近在做Kaggle上的wiki文章流量预测项目,这里由于个人电脑配置问题,我一直都是用的Kaggle的kernel,但是我们知道kernel的内存限制是16G,如下: ...
kaggle竞赛分享：NFL大数据碗（上篇）
kaggle竞赛分享:NFL大数据碗 - 上竞赛简介一年一度的NFL大数据碗,今年的预测目标是通过两队球员的静态数据,预测该次进攻推进的码数,并转换为该概率分布: 竞赛链接 https://www ...
Kaggle竞赛入门：决策树算法的Python实现
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
Kaggle竞赛入门（二）：如何验证机器学习模型
本文翻译自kaggle learn,也就是kaggle官方最快入门kaggle竞赛的教程,强调python编程实践和数学思想(而没有涉及数学细节),笔者在不影响算法和程序理解的基础上删除了一些不必要的 ...
《PYTHON机器学习及实践－从零开始通往KAGGLE竞赛之路》分享下载
转: <PYTHON机器学习及实践-从零开始通往KAGGLE竞赛之路> 分享下载书籍信息书名: PYTHON机器学习及实践-从零开始通往KAGGLE竞赛之路标签: PYTHON机器学 ...

随机推荐

CentOS安装mariadb做为mysql的替代品
mariadb做为mysql的替代品现在centos的新版本yum包已换成mariadb 安装一些库 yum install gcc gcc-c++ wget net-tools 复制代码查看SE ...
js object template
//== Class Definition var Test = function() { var login = $('#m_login'); //== Private Functions var ...
IDEA调试SpringMvc项目时，出错：java.lang.ClassNotFoundException: org.springframework.web.context.ContextLoaderListener，解决办法
具体报错信息如下图所示: 解决方法:File > Project Structure > Artifacts > 在右侧Output Layout右击项目名,选择Put into O ...
UseSwagger
if [ "$UseSwagger" != "true" ]; then sed -i "s/\"UseSwagger\": tr ...
Windows 7安装Tensorflow
以前是在Cent OS中运行Tensorflow,,经常需要切换操作系统,很不方便,于是决定在Windows 7下安装Tensorflow. 过程还是挺复杂的,需要安装的包括:Visual Studi ...
【iOS】ARC-MRC下的单例及其应用
单例的应用十分普遍,单例模式使一个类仅仅有一个实例. *易于供外界訪问. *方便控制实例个数,节约系统资源. *OC中的常见单例: 如:UIApplication, NSNotificationCe ...
C# 对轻量级(IoC Container)依赖注入Unity的使用
概述 Unity是一个轻量级的可扩展的依赖注入容器,支持构造函数,属性和方法调用注入.Unity可以处理那些从事基于组件的软件工程的开发人员所面对的问题.构建一个成功应用程序的关键是实现非常松散的耦合 ...
先从一个 libev 的 demo 入手
最近想研究下 libev 这个网络库,所以先从官方文档一个最简单的 demo 开始,代码如下: //io.c // a single header file is required #include ...
这些APP开发技巧可少花60万！
用户需求——我偏不用干嘛要装? 随着手机的普及,大众流量的端口从电脑转移到手机,传统的商业平台从线下到电脑再到手机进行了转换.手机APP作为移动互联网的入口,众多创业者凭借一个手机APP成就了亿万财富 ...
Spring Boot系列——如何集成Log4j2
上篇<Spring Boot系列--日志配置>介绍了Spring Boot如何进行日志配置,日志系统用的是Spring Boot默认的LogBack. 事实上,除了使用默认的LogBack ...

kaggle竞赛-保险转化-homesite

时间格式的转化

查看数据类型

查看DataFrame的详细信息

填充缺失值

category 数据类型转化

模型参数设定

下面的截断值0.5可以自己根据实际的项目设定截断值

结论

kaggle竞赛-保险转化-homesite的更多相关文章

随机推荐

热门专题