方案概述

近年来,互联网金融已经是当今社会上的一个金融发展趋势。在金融领域,无论是投资理财还是借贷放款,风险控制永远是业务的核心基础。对于消费金融来说,其主要服务对象的特点是:额度小、人群大、周期短,这个特性导致其被公认为是风险最高的细分领域。

以借贷为例,相比于传统的金融行业需要用户自己提供的资产资料的较单一途径,互联网金融更能将用户线下的资产情况,以及线上的网络消费行为进行资料整合,来进行综合分析,以便为用户提供更好的服务体验,为金融商家提供用户更全面的了解和评估。

随着人工智能和大数据等技术不断渗透,依靠金融科技主动收集、分析、整理各类金融数据,为细分人群提供更为精准的风控服务,成为解决消费金融风控问题的有效途径。简言之,如何区别违约风险用户,成为金融领域提供更为精准的风控服务的关键。

基于本赛题,大数据金融的违约用户风险预测,本文解决方案具体包括以下步骤:

  • 1.对用户的历史行为数据预处理操作;
  • 2.根据历史行为划分训练集数据、验证集数据;
  • 3.对用户历史数据进行特征工程操作;
  • 4.对构建特征完成的样本集进行特征选择;
  • 5.建立多个机器学习模型,并进行模型融合;
  • 6.通过建立的模型,根据用户历史行为数据对用户在未来一个月是否会逾期还款进行预测。

其中,图1展示了基于大数据金融的违约用户风险预测解决方案的流程图。

图1违约用户风险预测解决方案的流程图

二、数据洞察

2.1 数据预处理

1.异常值处理:针对数据中存在未知的异常值,采取直接过滤的方法进行处理会减少训练样本数量,从这里出发,将异常值用-1填充或者其他有区别于特征正常值的数值进行填充;

2.缺失值的多维度处理:在征信领域,用户信息的完善程度可能会影响该用户的信用评级。一个信息完善程度为100%的用户比起完善程度为50%的用户,会更加容易审核通过并得到借款。从这一点出发,对缺失值进行了多维度的分析和处理。按列(属性)统计缺失值个数,进一步得到各列的缺失比率,其中x_i为数据集中某属性列缺失值个数,Count为样本集总数,MissRate_i为数据集中该属性列缺失率:

3.其他处理:空格符处理,某些属性取值包含了空格字符,如“货到付款”和“货到付款 ”,它们明显是同一种取值,需要将空格符去除;城市名处理,包含有“重庆”、“重庆市”等取值,它们实际上是同一个城市,需要把字符中的“市”全部去掉。去掉类似于“市”的冗余之后,城市数目大大减少。

2.2 发现时序关系

根据用户历史数据,统计违约数量和未违约数量跟时间周期的关系,可视化实现如下图所示:

图2 违约数量和未违约数量跟时间周期的关系图

可以看出,时间对用户是否违约是成一定周期性的,且2017年明显比2016年的数量增加了很多,因此本文解决方案涉及很多时序特征。

2.3 划分训练集、验证集

对违约用户风险预测是一个长期且累积的过程,采取传统的按训练和测试集对应时间段滑窗法划分数据集并不是最佳方案,从这里出发,将历史用户数据全部用于训练集,更好的训练用户行为习惯,其中,验证集的构建采取交叉验证的方式,交叉验证如下图所示:

图3 交叉验证示意图

三、特征工程

3.1 0-1特征

主要基于auth、credit、user表提取,这三张表的id没有重复。

  • 标记auth表的Id_card、auth_time、phone是否为空;
  • 标记credit表的credit_score、overdraft、quota是否为空;
  • 标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否为空。
  • 标记auth表的Id_card、auth_time、phone是否正常(不为空);
  • 标记credit表的credit_score、overdraft、quota是否正常(不为空);
  • 标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否正常(不为空)。

3.2 信息完整度特征

主要基于auth、credit、user表提取,标记这三张表每条样本的信息完整度,定义为该条样本非空的属性数目/总属性数目。

3.3 one-hot特征

主要基于user表提取。

One-hot离散user表的sex、merriage、income、degree、qq_bound、wechat_bound、account_grade属性。

3.4 业务特征

基于业务逻辑提取的特征,最有效的特征,主要基于credit、auth、bankcard、order表提取。

  • (1)用户贷款提交时间(applsbm_time)和认证时间(auth_time)之差
  • (2)用户贷款提交时间(applsbm_time)和生日(birthday)之差
  • (3)信用评分(credit_score)反序
  • (4)信用额度未使用值(quota减overdraft)
  • (5)信用额度使用比率(overdraft除以quota)
  • (6)信用额度使用值是否超过信用额度(overdraft是否大于quota)
  • (7)银行卡(bankname)数目
  • (8)不同银行的银行卡(bankname)数目
  • (9)不同银行卡类型(card_type)数目
  • (10)不同银行卡预留电话(phone)数目
  • (11)提取order表的amt_order次数、type_pay_在线支付、type_pay——货到付款、sts_order_已完成次数,按id对order表去重,保留id重复的第一条样本

四、特征筛选

特征工程部分,构建了一系列基础特征、时序特征、业务特征、组合特征和离散特征等,所有特征加起来高达数百维,高维特征一方面可能会导致维数灾难,另一方面很容易导致模型过拟合。从这一点出发,通过特征选择来降低特征维度。比较高效的是基于学习模型的特征排序方法,可以达到目的:模型学习的过程和特征选择的过程是同时进行的,因此我们采用这种方法,基于 xgboost 来做特征选择,xgboost模型训练完成后可以输出特征的重要性(见图2),据此我们可以保留 top n 个特征,从而达到特征选择的目的。

五、模型训练

本文共计四个xgb模型,分别进行参数扰动、特征扰动,单模型效果均通过调参和特征选择,保证单模型最优,按四个模型不同比例融合,最终生成模型结果。

六、重要特征

通过XGBOOST模型输出特征重要性,降序排序,选取top20,可视化如下:

图4 特征重要性排序

列出模型所选的重要特征的前20个:表格样式如下:

七、创新点

7.1 特征

原始数据集很多属性比较乱,清洗了例如日期这样的属性方便特征提取;加入了信息完整度特征,很好地利用到了含有空值的样本;对于order这个id含有重复的样本,尝试了提取特征后按时间去重和按第一条和最后一条去重,发现按第一条去重效果是最好的,很好地使用到了order的信息;通过特征的重要性排序筛选了特征,也发现了提取的业务相关的特征是最重要的。

7.2 模型

模型的创新点主要体现在模型融合上。考察指标为AUC,侧重于答案的排序。在进行加权融合时,先对每个模型的结果进行了归一化,融合效果很好。

八、赛题思考

清洗数据非常重要,像时间这样的属性非常乱,处理起来也比较麻烦,我们只是简单地进行了处理,如果能够更细致的处理效果应该更好;某些属性,例如hobby,内容太复杂没有使用到,但这个属性肯定包含了许多有价值的信息,但遗憾没有发现一个好的处理方式。

04094.py

  1. # -*- coding: utf-8 -*-
  2. import pandas as pd
  3. import datetime
  4. import sys
  5. import numpy as np
  6. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
  7. import xgboost as xgb
  8. import re
  9. from sklearn.metrics import roc_auc_score
  10. from sklearn.metrics import auc
  11. def xgb_feature(train_set_x, train_set_y, test_set_x, test_set_y):
  12. # 模型参数
  13. params = {'booster': 'gbtree',
  14. 'objective': 'rank:pairwise',
  15. 'eval_metric': 'auc',
  16. 'eta': 0.02,
  17. 'max_depth': 5, # 4 3
  18. 'colsample_bytree': 0.7, # 0.8
  19. 'subsample': 0.7,
  20. 'min_child_weight': 1, # 2 3
  21. 'silent': 1
  22. }
  23. dtrain = xgb.DMatrix(train_set_x, label=train_set_y)
  24. dvali = xgb.DMatrix(test_set_x)
  25. model = xgb.train(params, dtrain, num_boost_round=800)
  26. predict = model.predict(dvali)
  27. return predict, model
  28. if __name__ == '__main__':
  29. IS_OFFLine = False
  30. """读取auth数据"""
  31. train_auth = pd.read_csv('../AI_risk_train_V3.0/train_auth_info.csv', parse_dates=['auth_time'])
  32. # 标记id_card,空为0,不空为1
  33. auth_idcard = train_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
  34. # 将标记数据加入到DataFrame中
  35. auth_idcard_df = pd.DataFrame()
  36. auth_idcard_df['id'] = train_auth['id']
  37. auth_idcard_df['auth_idcard_df'] = auth_idcard
  38. # 标记phone,空为0,不空为1
  39. auth_phone = train_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
  40. # 将标记数据加到DataFrame中
  41. auth_phone_df = pd.DataFrame()
  42. auth_phone_df['id'] = train_auth['id']
  43. auth_idcard_df['auth_phone_df'] = auth_phone
  44. """读取bankcard数据"""
  45. train_bankcard = pd.read_csv('../AI_risk_train_V3.0/train_bankcard_info.csv')
  46. # 不同银行的银行卡(bankname)数目
  47. train_bankcard_bank_count = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
  48. {'bankcard_count': lambda x: len(x)})
  49. # 不同银行卡类型(card_type)数目
  50. train_bankcard_card_count = train_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
  51. {'card_type_count': lambda x: len(set(x))})
  52. # 不同银行卡预留电话(phone)数目
  53. train_bankcard_phone_count = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
  54. {'phone_count': lambda x: len(set(x))})
  55. """读取credit数据"""
  56. train_credit = pd.read_csv('../AI_risk_train_V3.0/train_credit_info.csv')
  57. # 评分的反序
  58. train_credit['credit_score_inverse'] = train_credit['credit_score'].map(lambda x: 605 - x)
  59. # 额度-使用值
  60. train_credit['can_use'] = train_credit['quota'] - train_credit['overdraft']
  61. """读取order数据"""
  62. train_order = pd.read_csv('../AI_risk_train_V3.0/train_order_info.csv', parse_dates=['time_order'])
  63. # 标记amt_order数据,NA或null为nan,否则为本身
  64. train_order['amt_order'] = train_order['amt_order'].map(
  65. lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
  66. # 标记time_order,0、NA或nan为NaT,否则格式化
  67. train_order['time_order'] = train_order['time_order'].map(
  68. lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan') else (
  69. datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x) else (
  70. datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
  71. train_order_time_max = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  72. {'train_order_time_max': lambda x: max(x)})
  73. train_order_time_min = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  74. {'train_order_time_min': lambda x: min(x)})
  75. train_order_type_zaixian = train_order.groupby(by=['id']).apply(
  76. lambda x: x['type_pay'][(x['type_pay'] == ' 在线支付').values].count()).reset_index(name='type_pay_zaixian')
  77. train_order_type_huodao = train_order.groupby(by=['id']).apply(
  78. lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao')
  79. """读取地址信息数据"""
  80. train_recieve = pd.read_csv('../AI_risk_train_V3.0/train_recieve_addr_info.csv')
  81. # 截取region字符串前两位
  82. train_recieve['region'] = train_recieve['region'].map(lambda x: str(x)[:2])
  83. tmp_tmp_recieve = pd.crosstab(train_recieve.id, train_recieve.region)
  84. tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
  85. # 根据id分组,fix_phone的数量
  86. tmp_tmp_recieve_phone_count = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
  87. tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
  88. # 根据id分组,fix_phone的去重数量
  89. tmp_tmp_recieve_phone_count_unique = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
  90. tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index()
  91. """读取target数据"""
  92. train_target = pd.read_csv('../AI_risk_train_V3.0/train_target.csv', parse_dates=['appl_sbm_tm'])
  93. """读取user数据"""
  94. train_user = pd.read_csv('../AI_risk_train_V3.0/train_user_info.csv')
  95. # 标记hobby,nan为0,否则为1
  96. is_hobby = train_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
  97. # 将标记数据加入到DataFrame中
  98. is_hobby_df = pd.DataFrame()
  99. is_hobby_df['id'] = train_user['id']
  100. is_hobby_df['is_hobby'] = is_hobby
  101. # 标记id_card,nan为0,否则为1
  102. is_idcard = train_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
  103. # 将标记数据加入到DataFrame中
  104. is_idcard_df = pd.DataFrame()
  105. is_idcard_df['id'] = train_user['id']
  106. is_idcard_df['is_hobby'] = is_idcard
  107. # user_birthday
  108. tmp_tmp = train_user[['id', 'birthday']]
  109. tmp_tmp = tmp_tmp.set_index(['id'])
  110. is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
  111. is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
  112. is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
  113. is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
  114. is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
  115. is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
  116. # is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan')
  117. # 格式化birthday
  118. train_user['birthday'] = train_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
  119. re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT)
  120. # 合并基础特征
  121. train_data = pd.merge(train_target, train_auth, on=['id'], how='left')
  122. train_data = pd.merge(train_data, train_user, on=['id'], how='left')
  123. train_data = pd.merge(train_data, train_credit, on=['id'], how='left')
  124. train_data['hour'] = train_data['appl_sbm_tm'].map(lambda x: x.hour)
  125. train_data['month'] = train_data['appl_sbm_tm'].map(lambda x: x.month)
  126. train_data['year'] = train_data['appl_sbm_tm'].map(lambda x: x.year)
  127. train_data['quota_use_ratio'] = train_data['overdraft'] / (train_data['quota'] + 0.01)
  128. train_data['nan_num'] = train_data.isnull().sum(axis=1)
  129. train_data['diff_day'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
  130. train_data['how_old'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1)
  131. # 两个id_card是否相同
  132. auth_idcard = list(train_data['id_card_x'])
  133. user_idcard = list(train_data['id_card_y'])
  134. idcard_result = []
  135. for indexx, uu in enumerate(auth_idcard):
  136. if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
  137. idcard_result.append(0)
  138. elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
  139. idcard_result.append(1)
  140. elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
  141. idcard_result.append(2)
  142. else:
  143. ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
  144. ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
  145. if ttt1 == ttt2:
  146. idcard_result.append(3)
  147. if ttt1 != ttt2:
  148. idcard_result.append(4)
  149. train_data['the_same_id'] = idcard_result
  150. train_bankcard_phone_list = train_bankcard.groupby(by=['id'])['phone'].apply(
  151. lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
  152. # 合并
  153. train_data = pd.merge(train_data, train_bankcard_phone_list, on=['id'], how='left')
  154. train_data['exist_phone'] = train_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
  155. train_data['exist_phone'] = train_data['exist_phone'] * 1
  156. train_data = train_data.drop(['bank_phone_list'], axis=1)
  157. """bank_info"""
  158. bank_name = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
  159. {'bank_name_len': lambda x: len(set(x))})
  160. bank_num = train_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg(
  161. {'tail_num_len': lambda x: len(set(x))})
  162. bank_phone_num = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
  163. {'bank_phone_num': lambda x: x.nunique()})
  164. train_data = pd.merge(train_data, bank_name, on=['id'], how='left')
  165. train_data = pd.merge(train_data, bank_num, on=['id'], how='left')
  166. train_data = pd.merge(train_data, train_order_time_max, on=['id'], how='left')
  167. train_data = pd.merge(train_data, train_order_time_min, on=['id'], how='left')
  168. train_data = pd.merge(train_data, train_order_type_zaixian, on=['id'], how='left')
  169. train_data = pd.merge(train_data, train_order_type_huodao, on=['id'], how='left')
  170. train_data = pd.merge(train_data, is_double_, on=['id'], how='left')
  171. train_data = pd.merge(train_data, is_0_0_0, on=['id'], how='left')
  172. train_data = pd.merge(train_data, is_1_1_1, on=['id'], how='left')
  173. train_data = pd.merge(train_data, is_0000_00_00, on=['id'], how='left')
  174. train_data = pd.merge(train_data, is_0001_1_1, on=['id'], how='left')
  175. train_data = pd.merge(train_data, is_hou_in, on=['id'], how='left')
  176. train_data = pd.merge(train_data, tmp_tmp_recieve, on=['id'], how='left')
  177. train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
  178. train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
  179. train_data = pd.merge(train_data, bank_phone_num, on=['id'], how='left')
  180. train_data = pd.merge(train_data, is_hobby_df, on=['id'], how='left')
  181. train_data = pd.merge(train_data, is_idcard_df, on=['id'], how='left')
  182. train_data = pd.merge(train_data, auth_idcard_df, on=['id'], how='left')
  183. train_data = pd.merge(train_data, auth_phone_df, on=['id'], how='left')
  184. train_data['day_order_max'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_max']).days,
  185. axis=1)
  186. train_data = train_data.drop(['train_order_time_max'], axis=1)
  187. train_data['day_order_min'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_min']).days,
  188. axis=1)
  189. train_data = train_data.drop(['train_order_time_min'], axis=1)
  190. """order_info"""
  191. order_time = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
  192. order_mean = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
  193. unit_price_mean = train_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
  194. order_time_set = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  195. {'order_time_set': lambda x: len(set(x))})
  196. train_data = pd.merge(train_data, order_time, on=['id'], how='left')
  197. train_data = pd.merge(train_data, order_mean, on=['id'], how='left')
  198. train_data = pd.merge(train_data, order_time_set, on=['id'], how='left')
  199. train_data = pd.merge(train_data, unit_price_mean, on=['id'], how='left')
  200. if IS_OFFLine == False:
  201. # 在测试集上验证
  202. train_data = train_data.drop(
  203. ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
  204. if IS_OFFLine == True:
  205. # 在验证集上验证
  206. dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
  207. dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
  208. # 加入One-Hot后的特征
  209. train_data_copy = pd.concat([train_data, dummy_df], axis=1)
  210. train_data_copy = train_data_copy.fillna(0)
  211. # 删除One-Hot前的特征
  212. vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
  213. # 划分训练集和验证集
  214. valid_train_train = vaild_train_data[vaild_train_data.appl_sbm_tm < datetime.datetime(2017, 4, 1)]
  215. valid_train_test = vaild_train_data[vaild_train_data.appl_sbm_tm >= datetime.datetime(2017, 4, 1)]
  216. valid_train_train = valid_train_train.drop(
  217. ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
  218. valid_train_test = valid_train_test.drop(
  219. ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
  220. # 训练集特征
  221. vaild_train_x = valid_train_train.drop(['target'], axis=1)
  222. # 验证集特征
  223. vaild_test_x = valid_train_test.drop(['target'], axis=1)
  224. redict_result, modelee = xgb_feature(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
  225. print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result))
  226. sys.exit(23)
  227. """************************测试集数据处理***********************************"""
  228. """auth_info"""
  229. test_auth = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_auth_info.csv', parse_dates=['auth_time'])
  230. # 标记id_card,nan为0,否则为1
  231. auth_idcard = test_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
  232. # 将标记数据加入到DataFrame中
  233. auth_idcard_df = pd.DataFrame()
  234. auth_idcard_df['id'] = test_auth['id']
  235. auth_idcard_df['auth_idcard_df'] = auth_idcard
  236. # 标记phone,nan为0,否则为1
  237. auth_phone = test_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
  238. # 将标记数据加入到DataFrame中
  239. auth_phone_df = pd.DataFrame()
  240. auth_phone_df['id'] = test_auth['id']
  241. auth_idcard_df['auth_phone_df'] = auth_phone
  242. test_auth['auth_time'].replace('0000-00-00', 'nan', inplace=True)
  243. test_auth['auth_time'] = pd.to_datetime(test_auth['auth_time'])
  244. """bankcard_info"""
  245. test_bankcard = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_bankcard_info.csv')
  246. # 不同银行的银行卡(bankname)数目
  247. test_bankcard_bank_count = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
  248. {'bankcard_count': lambda x: len(x)})
  249. # 不同银行卡类型(card_type)数目
  250. test_bankcard_card_count = test_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
  251. {'card_type_count': lambda x: len(set(x))})
  252. # 不同银行卡预留电话(phone)数目
  253. test_bankcard_phone_count = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
  254. {'phone_count': lambda x: len(set(x))})
  255. """credit_info"""
  256. test_credit = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_credit_info.csv')
  257. # 信用评分反序
  258. test_credit['credit_score_inverse'] = test_credit['credit_score'].map(lambda x: 605 - x)
  259. # 额度-使用值
  260. test_credit['can_use'] = test_credit['quota'] - test_credit['overdraft']
  261. """order_info"""
  262. test_order = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_order_info.csv', parse_dates=['time_order'])
  263. # 标记amt_order数据,NA或null为nan,否则为本身
  264. test_order['amt_order'] = test_order['amt_order'].map(
  265. lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
  266. # 标记time_order,0、NA或nan为NaT,否则格式化
  267. test_order['time_order'] = test_order['time_order'].map(
  268. lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan')
  269. else (datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x)
  270. else (datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
  271. # 根据id分组后的最大交易时间
  272. test_order_time_max = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  273. {'test_order_time_max': lambda x: max(x)})
  274. # 根据id分组后的最小交易时间
  275. test_order_time_min = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  276. {'test_order_time_min': lambda x: min(x)})
  277. # 根据id分组后的在线支付的数量
  278. test_order_type_zaixian = test_order.groupby(by=['id']).apply(
  279. lambda x: x['type_pay'][(x['type_pay'] == '在线支付').values].count()).reset_index(name='type_pay_zaixian')
  280. # 根据id分组后的货到付款的数量
  281. test_order_type_huodao = test_order.groupby(by=['id']).apply(
  282. lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao')
  283. """recieve_addr_info"""
  284. test_recieve = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_recieve_addr_info.csv')
  285. # 截取region字符串前两位
  286. test_recieve['region'] = test_recieve['region'].map(lambda x: str(x)[:2])
  287. tmp_tmp_recieve = pd.crosstab(test_recieve.id, test_recieve.region)
  288. tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
  289. # 根据id分组,fix_phone的数量
  290. tmp_tmp_recieve_phone_count = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
  291. tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
  292. # 根据id分组,fix_phone的去重数量
  293. tmp_tmp_recieve_phone_count_unique = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
  294. tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index()
  295. """test_list"""
  296. test_target = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
  297. """test_user_info"""
  298. test_user = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_user_info.csv', parse_dates=['birthday'])
  299. # 标记hobby,nan为0,否则为1
  300. is_hobby = test_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
  301. # 将标记数据加入到DataFrame中
  302. is_hobby_df = pd.DataFrame()
  303. is_hobby_df['id'] = test_user['id']
  304. is_hobby_df['is_hobby'] = is_hobby
  305. # 标记id_card,nan为0,否则为1
  306. is_idcard = test_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
  307. # 将标记数据加入到DataFrame中
  308. is_idcard_df = pd.DataFrame()
  309. is_idcard_df['id'] = test_user['id']
  310. is_idcard_df['is_hobby'] = is_idcard
  311. # user_birthday
  312. # 解决关联报错问题
  313. train_user['id'] = pd.to_numeric(train_user['id'], errors='coerce')
  314. tmp_tmp = train_user[['id', 'birthday']]
  315. tmp_tmp = tmp_tmp.set_index(['id'])
  316. is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
  317. is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
  318. is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
  319. is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
  320. is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
  321. is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
  322. # is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan')
  323. # 格式化birthday
  324. test_user['birthday'] = test_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
  325. re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT)
  326. test_data = pd.merge(test_target, test_auth, on=['id'], how='left')
  327. test_data = pd.merge(test_data, test_user, on=['id'], how='left')
  328. test_data = pd.merge(test_data, test_credit, on=['id'], how='left')
  329. test_data['hour'] = test_data['appl_sbm_tm'].map(lambda x: x.hour)
  330. test_data['month'] = test_data['appl_sbm_tm'].map(lambda x: x.month)
  331. test_data['year'] = test_data['appl_sbm_tm'].map(lambda x: x.year)
  332. test_data['quota_use_ratio'] = test_data['overdraft'] / (test_data['quota'] + 0.01)
  333. test_data['nan_num'] = test_data.isnull().sum(axis=1)
  334. test_data['diff_day'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
  335. test_data['how_old'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1)
  336. # 两个id_card是否相同
  337. auth_idcard = list(test_data['id_card_x'])
  338. user_idcard = list(test_data['id_card_y'])
  339. idcard_result = []
  340. for indexx, uu in enumerate(auth_idcard):
  341. if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
  342. idcard_result.append(0)
  343. elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
  344. idcard_result.append(1)
  345. elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
  346. idcard_result.append(2)
  347. else:
  348. ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
  349. ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
  350. if ttt1 == ttt2:
  351. idcard_result.append(3)
  352. if ttt1 != ttt2:
  353. idcard_result.append(4)
  354. test_data['the_same_id'] = idcard_result
  355. test_bankcard_phone_list = test_bankcard.groupby(by=['id'])['phone'].apply(
  356. lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
  357. test_data = pd.merge(test_data, test_bankcard_phone_list, on=['id'], how='left')
  358. test_data['exist_phone'] = test_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
  359. test_data['exist_phone'] = test_data['exist_phone'] * 1
  360. test_data = test_data.drop(['bank_phone_list'], axis=1)
  361. """bankcard_info"""
  362. bank_name = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
  363. {'bank_name_len': lambda x: len(set(x))})
  364. bank_num = test_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg({'tail_num_len': lambda x: len(set(x))})
  365. bank_phone_num = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
  366. {'bank_phone_num': lambda x: x.nunique()})
  367. test_data = pd.merge(test_data, bank_name, on=['id'], how='left')
  368. test_data = pd.merge(test_data, bank_num, on=['id'], how='left')
  369. test_data = pd.merge(test_data, test_order_time_max, on=['id'], how='left')
  370. test_data = pd.merge(test_data, test_order_time_min, on=['id'], how='left')
  371. test_data = pd.merge(test_data, test_order_type_zaixian, on=['id'], how='left')
  372. test_data = pd.merge(test_data, test_order_type_huodao, on=['id'], how='left')
  373. test_data = pd.merge(test_data, is_double_, on=['id'], how='left')
  374. test_data = pd.merge(test_data, is_0_0_0, on=['id'], how='left')
  375. test_data = pd.merge(test_data, is_1_1_1, on=['id'], how='left')
  376. test_data = pd.merge(test_data, is_0000_00_00, on=['id'], how='left')
  377. test_data = pd.merge(test_data, is_0001_1_1, on=['id'], how='left')
  378. test_data = pd.merge(test_data, is_hou_in, on=['id'], how='left')
  379. test_data = pd.merge(test_data, tmp_tmp_recieve, on=['id'], how='left')
  380. test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
  381. test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
  382. test_data = pd.merge(test_data, bank_phone_num, on=['id'], how='left')
  383. test_data = pd.merge(test_data, is_hobby_df, on=['id'], how='left')
  384. test_data = pd.merge(test_data, is_idcard_df, on=['id'], how='left')
  385. test_data = pd.merge(test_data, auth_idcard_df, on=['id'], how='left')
  386. test_data = pd.merge(test_data, auth_phone_df, on=['id'], how='left')
  387. test_data['day_order_max'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_max']).days,
  388. axis=1)
  389. test_data = test_data.drop(['test_order_time_max'], axis=1)
  390. test_data['day_order_min'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_min']).days,
  391. axis=1)
  392. test_data = test_data.drop(['test_order_time_min'], axis=1)
  393. """order_info"""
  394. order_time = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
  395. order_mean = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
  396. unit_price_mean = test_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
  397. order_time_set = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
  398. {'order_time_set': lambda x: len(set(x))})
  399. test_data = pd.merge(test_data, order_time, on=['id'], how='left')
  400. test_data = pd.merge(test_data, order_mean, on=['id'], how='left')
  401. test_data = pd.merge(test_data, order_time_set, on=['id'], how='left')
  402. test_data = pd.merge(test_data, unit_price_mean, on=['id'], how='left')
  403. test_data = test_data.drop(
  404. ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
  405. test_data['target'] = -1
  406. test_data.to_csv('8288test.csv', index=None)
  407. train_data.to_csv('8288train.csv', index=None)
  408. dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
  409. train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
  410. train_test_data = train_test_data.fillna(0)
  411. dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
  412. train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
  413. train_test_data = train_test_data.drop(dummy_fea, axis=1)
  414. train_train = train_test_data.iloc[:train_data.shape[0], :]
  415. test_test = train_test_data.iloc[train_data.shape[0]:, :]
  416. train_train_x = train_train.drop(['target'], axis=1)
  417. test_test_x = test_test.drop(['target'], axis=1)
  418. predict_result, modelee = xgb_feature(train_train_x, train_train['target'].values, test_test_x, None)
  419. ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
  420. ans['PROB'] = predict_result
  421. ans = ans.drop(['appl_sbm_tm'], axis=1)
  422. minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
  423. ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
  424. ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
  425. ans.to_csv('04094test.csv', index=None)

stacking.py

  1. # -*- coding: utf-8 -*-
  2. from heamy.dataset import Dataset
  3. from heamy.estimator import Regressor, Classifier
  4. from heamy.pipeline import ModelsPipeline
  5. import pandas as pd
  6. import xgboost as xgb
  7. import datetime
  8. from sklearn.metrics import roc_auc_score
  9. import lightgbm as lgb
  10. from sklearn.preprocessing import LabelEncoder, OneHotEncoder
  11. from sklearn.linear_model import LinearRegression
  12. from sklearn.ensemble import ExtraTreesClassifier
  13. from sklearn.ensemble import GradientBoostingClassifier
  14. from sklearn.linear_model import LogisticRegression
  15. import numpy as np
  16. def xgb_feature(X_train, y_train, X_test, y_test=None):
  17. # 模型参数
  18. params = {'booster': 'gbtree',
  19. 'objective': 'rank:pairwise',
  20. 'eval_metric': 'auc',
  21. 'eta': 0.02,
  22. 'max_depth': 5, # 4 3
  23. 'colsample_bytree': 0.7, # 0.8
  24. 'subsample': 0.7,
  25. 'min_child_weight': 1, # 2 3
  26. 'seed': 1111,
  27. 'silent': 1
  28. }
  29. dtrain = xgb.DMatrix(X_train, label=y_train)
  30. dvali = xgb.DMatrix(X_test)
  31. model = xgb.train(params, dtrain, num_boost_round=800)
  32. predict = model.predict(dvali)
  33. minmin = min(predict)
  34. maxmax = max(predict)
  35. vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
  36. return vfunc(predict)
  37. def xgb_feature2(X_train, y_train, X_test, y_test=None):
  38. # 模型参数
  39. params = {'booster': 'gbtree',
  40. 'objective': 'rank:pairwise',
  41. 'eval_metric': 'auc',
  42. 'eta': 0.015,
  43. 'max_depth': 5, # 4 3
  44. 'colsample_bytree': 0.7, # 0.8
  45. 'subsample': 0.7,
  46. 'min_child_weight': 1, # 2 3
  47. 'seed': 11,
  48. 'silent': 1
  49. }
  50. dtrain = xgb.DMatrix(X_train, label=y_train)
  51. dvali = xgb.DMatrix(X_test)
  52. model = xgb.train(params, dtrain, num_boost_round=1200)
  53. predict = model.predict(dvali)
  54. minmin = min(predict)
  55. maxmax = max(predict)
  56. vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
  57. return vfunc(predict)
  58. def xgb_feature3(X_train, y_train, X_test, y_test=None):
  59. # 模型参数
  60. params = {'booster': 'gbtree',
  61. 'objective': 'rank:pairwise',
  62. 'eval_metric': 'auc',
  63. 'eta': 0.01,
  64. 'max_depth': 5, # 4 3
  65. 'colsample_bytree': 0.7, # 0.8
  66. 'subsample': 0.7,
  67. 'min_child_weight': 1, # 2 3
  68. 'seed': 1,
  69. 'silent': 1
  70. }
  71. dtrain = xgb.DMatrix(X_train, label=y_train)
  72. dvali = xgb.DMatrix(X_test)
  73. model = xgb.train(params, dtrain, num_boost_round=2000)
  74. predict = model.predict(dvali)
  75. minmin = min(predict)
  76. maxmax = max(predict)
  77. vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
  78. return vfunc(predict)
  79. def et_model(X_train, y_train, X_test, y_test=None):
  80. model = ExtraTreesClassifier(max_features='log2', n_estimators=1000, n_jobs=-1).fit(X_train, y_train)
  81. return model.predict_proba(X_test)[:, 1]
  82. def gbdt_model(X_train, y_train, X_test, y_test=None):
  83. model = GradientBoostingClassifier(learning_rate=0.02, max_features=0.7, n_estimators=700, max_depth=5).fit(X_train,
  84. y_train)
  85. predict = model.predict_proba(X_test)[:, 1]
  86. minmin = min(predict)
  87. maxmax = max(predict)
  88. vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
  89. return vfunc(predict)
  90. def logistic_model(X_train, y_train, X_test, y_test=None):
  91. model = LogisticRegression(penalty='l2').fit(X_train, y_train)
  92. return model.predict_proba(X_test)[:, 1]
  93. def lgb_feature(X_train, y_train, X_test, y_test=None):
  94. lgb_train = lgb.Dataset(X_train, y_train,
  95. categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
  96. 'account_grade', 'industry'})
  97. lgb_test = lgb.Dataset(X_test,
  98. categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
  99. 'account_grade', 'industry'})
  100. params = {
  101. 'task': 'train',
  102. 'boosting_type': 'gbdt',
  103. 'objective': 'binary',
  104. 'metric': 'auc',
  105. 'num_leaves': 25,
  106. 'learning_rate': 0.01,
  107. 'feature_fraction': 0.7,
  108. 'bagging_fraction': 0.7,
  109. 'bagging_freq': 5,
  110. 'min_data_in_leaf': 5,
  111. 'max_bin': 200,
  112. 'verbose': 0,
  113. }
  114. gbm = lgb.train(params,
  115. lgb_train,
  116. num_boost_round=2000)
  117. predict = gbm.predict(X_test)
  118. minmin = min(predict)
  119. maxmax = max(predict)
  120. vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
  121. return vfunc(predict)
  122. if __name__ == '__main__':
  123. VAILD = False
  124. if VAILD == True:
  125. # 训练数据
  126. train_data = pd.read_csv('8288train.csv', engine='python')
  127. # train_data = train_data.drop(
  128. # ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
  129. # 填充
  130. train_data = train_data.fillna(0)
  131. # # 测试数据
  132. # test_data = pd.read_csv('8288test.csv', engine='python')
  133. # # 填充0
  134. # test_data = test_data.fillna(0)
  135. # One-Hot
  136. dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
  137. dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
  138. # 合并One-Hot数据
  139. train_data_copy = pd.concat([train_data, dummy_df], axis=1)
  140. # 填充0
  141. train_data_copy = train_data_copy.fillna(0)
  142. # 删除One-Hot之前的数据
  143. vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
  144. # 训练集
  145. valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
  146. # 测试集
  147. valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
  148. # 训练集特征
  149. vaild_train_x = valid_train_train.drop(['target'], axis=1)
  150. # 测试集特征
  151. vaild_test_x = valid_train_test.drop(['target'], axis=1)
  152. # 逻辑回归模型
  153. redict_result = logistic_model(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
  154. print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result))
  155. # dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound','account_grade','industry']
  156. # for _fea in dummy_fea:
  157. # print(_fea)
  158. # le = LabelEncoder()
  159. # le.fit(train_data[_fea].tolist())
  160. # train_data[_fea] = le.transform(train_data[_fea].tolist())
  161. # train_data_copy = train_data.copy()
  162. # vaild_train_data = train_data_copy
  163. # valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
  164. # valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
  165. # vaild_train_x = valid_train_train.drop(['target'],axis=1)
  166. # vaild_test_x = valid_train_test.drop(['target'],axis=1)
  167. # redict_result = lgb_feature(vaild_train_x,valid_train_train['target'].values,vaild_test_x,None)
  168. # print('valid auc',roc_auc_score(valid_train_test['target'].values,redict_result))
  169. if VAILD == False:
  170. # 训练数据
  171. train_data = pd.read_csv('8288train.csv', engine='python')
  172. train_data = train_data.fillna(0)
  173. # 测试数据
  174. test_data = pd.read_csv('8288test.csv', engine='python')
  175. test_data = test_data.fillna(0)
  176. # 合并训练数据和测试数据
  177. train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
  178. train_test_data = train_test_data.fillna(0)
  179. # 划分
  180. train_data = train_test_data.iloc[:train_data.shape[0], :]
  181. test_data = train_test_data.iloc[train_data.shape[0]:, :]
  182. dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
  183. for _fea in dummy_fea:
  184. print(_fea)
  185. le = LabelEncoder()
  186. le.fit(train_data[_fea].tolist() + test_data[_fea].tolist())
  187. tmp = le.transform(train_data[_fea].tolist() + test_data[_fea].tolist())
  188. train_data[_fea] = tmp[:train_data.shape[0]]
  189. test_data[_fea] = tmp[train_data.shape[0]:]
  190. train_x = train_data.drop(['target'], axis=1)
  191. test_x = test_data.drop(['target'], axis=1)
  192. lgb_dataset = Dataset(train_x, train_data['target'], test_x, use_cache=False)
  193. """**********************************************************"""
  194. # 训练数据
  195. train_data = pd.read_csv('8288train.csv', engine='python')
  196. train_data = train_data.fillna(0)
  197. # 测试数据
  198. test_data = pd.read_csv('8288test.csv', engine='python')
  199. test_data = test_data.fillna(0)
  200. # 合并训练数据和测试数据
  201. train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
  202. train_test_data = train_test_data.fillna(0)
  203. # One-Hot
  204. dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
  205. dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
  206. train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
  207. train_test_data = train_test_data.drop(dummy_fea, axis=1)
  208. train_train = train_test_data.iloc[:train_data.shape[0], :]
  209. test_test = train_test_data.iloc[train_data.shape[0]:, :]
  210. train_train_x = train_train.drop(['target'], axis=1)
  211. test_test_x = test_test.drop(['target'], axis=1)
  212. xgb_dataset = Dataset(X_train=train_train_x, y_train=train_train['target'], X_test=test_test_x, y_test=None,
  213. use_cache=False)
  214. # heamy
  215. model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature, name='xgb', use_cache=False)
  216. model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2, name='xgb2', use_cache=False)
  217. model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3, name='xgb3', use_cache=False)
  218. model_lgb = Regressor(dataset=lgb_dataset, estimator=lgb_feature, name='lgb', use_cache=False)
  219. model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model, name='gbdt', use_cache=False)
  220. pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_gbdt)
  221. stack_ds = pipeline.stack(k=5, seed=111, add_diff=False, full_test=True)
  222. stacker = Regressor(dataset=stack_ds, estimator=LinearRegression, parameters={'fit_intercept': False})
  223. predict_result = stacker.predict()
  224. ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
  225. ans['PROB'] = predict_result
  226. ans = ans.drop(['appl_sbm_tm'], axis=1)
  227. minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
  228. ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
  229. ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
  230. ans.to_csv('ans_stacking.csv', index=None)

Reference:https://github.com/chenkkkk/User-loan-risk-prediction

马上AI全球挑战者大赛-违约用户风险预测的更多相关文章

  1. 2018科大讯飞AI营销算法大赛全面来袭,等你来战!

    AI技术已成为推动营销迭代的重要驱动力.AI营销高速发展的同时,积累了海量的广告数据和用户数据.如何有效应用这些数据,是大数据技术落地营销领域的关键,也是检测智能营销平台竞争力的标准. 讯飞AI营销云 ...

  2. 华为AI应用创新大赛即将开启!公开课已备好!

    为鼓励开发者创新,挖掘前沿创新能力的应用及服务,帮开发者打造爆款应用的同时丰富终端消费者的用户体验,由设立10亿激励基金耀星计划扶持的华为创新竞赛平台即将开启. 竞赛平台将滚动推出AI.HAG.AR. ...

  3. 参加2018之江杯全球人工智能大赛
:视频识别&问答

    学习了一段时间的AI,用天池大赛来检验一下自己的学习成果. 题目:参赛者需对给定的短视频进行内容识别和分析,并回答每一个视频对应的问题.细节请到阿里天池搜索. 两种思路 1 将视频截成一帧一帧的图片, ...

  4. 参加2018之江杯全球人工智能大赛 :视频识别&问答(四)

    很遗憾没有在规定的时间点(2018-9-25 12:00:00)完成所有的功能并上传数据,只做到写了模型代码并只跑了一轮迭代,现将代码部分贴出. import keras from keras.lay ...

  5. GitHub 干货 | 各大数据竞赛 Top 解决方案开源汇总

    AI 科技评论编者按:现在,越来越多的企业.高校以及学术组织机构通过举办各种类型的数据竞赛来「物色」数据科学领域的优秀人才,并借此激励他们为某一数据领域或应用场景找到具有突破性意义的方案,也为之后的数 ...

  6. Imagine Cup 微软“创新杯”全球学生科技大赛

    一. 介绍 微软创新杯微博:http://blog.sina.com.cn/u/1733906825 官方站点:https://www.microsoft.com/china/msdn/student ...

  7. 如何利用AI识别未知——加入未知类(不太靠谱),检测待识别数据和已知样本数据的匹配程度(例如使用CNN降维,再用knn类似距离来实现),将问题转化为特征搜索问题而非决策问题,使用HTM算法(记忆+模式匹配预测就是智能),GAN异常检测,RBF

    https://www.researchgate.net/post/How_to_determine_unknown_class_using_neural_network 里面有讨论,说是用rbf神经 ...

  8. AI 预测蛋白质结构「GitHub 热点速览 v.21.29」

    作者:HelloGitHub-小鱼干 虽然 AI 领域藏龙卧虎,但是本周预测蛋白质结构的 alphafold 一开源出来就刷爆了朋友圈,虽然项目与我无关,但是看着科技进步能探寻到生命机理,吃瓜群众也有 ...

  9. 活动招募 HUAWEI HiAI公开课·北京站-如何在4小时把你的APP变身AI应用

    人工智能和机器学习是全球关注的新趋势,也是当前最火爆.最流行的话题.当你拿手机用语音助手帮你点外卖,智能推荐帮你把周边美食一网打尽:当你拿起P20拍照时,它将自动识别场景进行最美优化,让你成为摄影大师 ...

随机推荐

  1. string find_last_of 用法

    int find_first_of(char c, int start = 0):              查找字符串中第1个出现的c,由位置start开始.              如果有匹配, ...

  2. GetAsyncKeyState()& 0x8000

    0x8000 & GetKeyState(VK_SHIFT); 这句是判断是否有按下shift键. 关于GetAsyncKeyState与GetKeyState区别:关于GetAsyncKey ...

  3. 五、Java基础加强

    Java基础加强 1.MyEclipse的使用工作空间(workspace).工程(project)在eclipse下Java程序的编写和运行,及java运行环境的配置.快捷键的配置,常用快捷键:内容 ...

  4. HBASE的Java与Javaweb(采用MVC模式)实现增删改查附带源码

    项目文件截图 Java运行截图 package domain; import java.io.IOException; import java.util.ArrayList; import java. ...

  5. 在linux中安装selenium+chrome

    主要参照百度的一些内容加上自己的实际操作,对自己遇到的几个问题进行总结: 第一个问题:安装selenium---sudo pip install selenium 显示:You are using p ...

  6. python学习之旅(三)

    Python基础知识(2):运算符 一.算术运算符 加 +,减 -,乘 *,除 /,幂 **,求余 %,取整 // 二.成员运算符 in,not in 判断一个字符是否在字符串中 name = &qu ...

  7. Ubuntu 16.04 Java8 安装

    添加ppa apt-get update apt install software-properties-common add-apt-repository ppa:webupd8team/java ...

  8. JVM内存布局

    1. 概述 对于从事c和c++程序开发的开发人员来说,在内存管理领域,他们既拥有最高权力的”皇帝“又是从事最基础工作的”劳动人民“---既拥有每个对象的”所有权“,又担负着每个对象开始到终结的维护责任 ...

  9. freertos 建立任务超过几个后系统不能能运行

    /* *** NOTE *********************************************************** If you find your application ...

  10. 25个常用PowerShell命令总结

    尽管Windows PowerShell已经出现一段时间了,习惯命令行的管理员可能对了解PowerShell功能的基础很感兴趣. 下面我们看看能由Windows PowerShell完成的最常见的25 ...