马上AI全球挑战者大赛-违约用户风险预测
方案概述
近年来,互联网金融已经是当今社会上的一个金融发展趋势。在金融领域,无论是投资理财还是借贷放款,风险控制永远是业务的核心基础。对于消费金融来说,其主要服务对象的特点是:额度小、人群大、周期短,这个特性导致其被公认为是风险最高的细分领域。
以借贷为例,相比于传统的金融行业需要用户自己提供的资产资料的较单一途径,互联网金融更能将用户线下的资产情况,以及线上的网络消费行为进行资料整合,来进行综合分析,以便为用户提供更好的服务体验,为金融商家提供用户更全面的了解和评估。
随着人工智能和大数据等技术不断渗透,依靠金融科技主动收集、分析、整理各类金融数据,为细分人群提供更为精准的风控服务,成为解决消费金融风控问题的有效途径。简言之,如何区别违约风险用户,成为金融领域提供更为精准的风控服务的关键。
基于本赛题,大数据金融的违约用户风险预测,本文解决方案具体包括以下步骤:
- 1.对用户的历史行为数据预处理操作;
- 2.根据历史行为划分训练集数据、验证集数据;
- 3.对用户历史数据进行特征工程操作;
- 4.对构建特征完成的样本集进行特征选择;
- 5.建立多个机器学习模型,并进行模型融合;
- 6.通过建立的模型,根据用户历史行为数据对用户在未来一个月是否会逾期还款进行预测。
其中,图1展示了基于大数据金融的违约用户风险预测解决方案的流程图。
图1违约用户风险预测解决方案的流程图
二、数据洞察
2.1 数据预处理
1.异常值处理:针对数据中存在未知的异常值,采取直接过滤的方法进行处理会减少训练样本数量,从这里出发,将异常值用-1填充或者其他有区别于特征正常值的数值进行填充;
2.缺失值的多维度处理:在征信领域,用户信息的完善程度可能会影响该用户的信用评级。一个信息完善程度为100%的用户比起完善程度为50%的用户,会更加容易审核通过并得到借款。从这一点出发,对缺失值进行了多维度的分析和处理。按列(属性)统计缺失值个数,进一步得到各列的缺失比率,其中x_i为数据集中某属性列缺失值个数,Count为样本集总数,MissRate_i为数据集中该属性列缺失率:
3.其他处理:空格符处理,某些属性取值包含了空格字符,如“货到付款”和“货到付款 ”,它们明显是同一种取值,需要将空格符去除;城市名处理,包含有“重庆”、“重庆市”等取值,它们实际上是同一个城市,需要把字符中的“市”全部去掉。去掉类似于“市”的冗余之后,城市数目大大减少。
2.2 发现时序关系
根据用户历史数据,统计违约数量和未违约数量跟时间周期的关系,可视化实现如下图所示:
图2 违约数量和未违约数量跟时间周期的关系图
可以看出,时间对用户是否违约是成一定周期性的,且2017年明显比2016年的数量增加了很多,因此本文解决方案涉及很多时序特征。
2.3 划分训练集、验证集
对违约用户风险预测是一个长期且累积的过程,采取传统的按训练和测试集对应时间段滑窗法划分数据集并不是最佳方案,从这里出发,将历史用户数据全部用于训练集,更好的训练用户行为习惯,其中,验证集的构建采取交叉验证的方式,交叉验证如下图所示:
图3 交叉验证示意图
三、特征工程
3.1 0-1特征
主要基于auth、credit、user表提取,这三张表的id没有重复。
- 标记auth表的Id_card、auth_time、phone是否为空;
- 标记credit表的credit_score、overdraft、quota是否为空;
- 标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否为空。
- 标记auth表的Id_card、auth_time、phone是否正常(不为空);
- 标记credit表的credit_score、overdraft、quota是否正常(不为空);
- 标记user表的sex、birthday、hobby、merriage、income、id_card、degree、industry、qq_bound、wechat_bound、account_grade是否正常(不为空)。
3.2 信息完整度特征
主要基于auth、credit、user表提取,标记这三张表每条样本的信息完整度,定义为该条样本非空的属性数目/总属性数目。
3.3 one-hot特征
主要基于user表提取。
One-hot离散user表的sex、merriage、income、degree、qq_bound、wechat_bound、account_grade属性。
3.4 业务特征
基于业务逻辑提取的特征,最有效的特征,主要基于credit、auth、bankcard、order表提取。
- (1)用户贷款提交时间(applsbm_time)和认证时间(auth_time)之差
- (2)用户贷款提交时间(applsbm_time)和生日(birthday)之差
- (3)信用评分(credit_score)反序
- (4)信用额度未使用值(quota减overdraft)
- (5)信用额度使用比率(overdraft除以quota)
- (6)信用额度使用值是否超过信用额度(overdraft是否大于quota)
- (7)银行卡(bankname)数目
- (8)不同银行的银行卡(bankname)数目
- (9)不同银行卡类型(card_type)数目
- (10)不同银行卡预留电话(phone)数目
- (11)提取order表的amt_order次数、type_pay_在线支付、type_pay——货到付款、sts_order_已完成次数,按id对order表去重,保留id重复的第一条样本
四、特征筛选
特征工程部分,构建了一系列基础特征、时序特征、业务特征、组合特征和离散特征等,所有特征加起来高达数百维,高维特征一方面可能会导致维数灾难,另一方面很容易导致模型过拟合。从这一点出发,通过特征选择来降低特征维度。比较高效的是基于学习模型的特征排序方法,可以达到目的:模型学习的过程和特征选择的过程是同时进行的,因此我们采用这种方法,基于 xgboost 来做特征选择,xgboost模型训练完成后可以输出特征的重要性(见图2),据此我们可以保留 top n 个特征,从而达到特征选择的目的。
五、模型训练
本文共计四个xgb模型,分别进行参数扰动、特征扰动,单模型效果均通过调参和特征选择,保证单模型最优,按四个模型不同比例融合,最终生成模型结果。
六、重要特征
通过XGBOOST模型输出特征重要性,降序排序,选取top20,可视化如下:
图4 特征重要性排序
列出模型所选的重要特征的前20个:表格样式如下:
七、创新点
7.1 特征
原始数据集很多属性比较乱,清洗了例如日期这样的属性方便特征提取;加入了信息完整度特征,很好地利用到了含有空值的样本;对于order这个id含有重复的样本,尝试了提取特征后按时间去重和按第一条和最后一条去重,发现按第一条去重效果是最好的,很好地使用到了order的信息;通过特征的重要性排序筛选了特征,也发现了提取的业务相关的特征是最重要的。
7.2 模型
模型的创新点主要体现在模型融合上。考察指标为AUC,侧重于答案的排序。在进行加权融合时,先对每个模型的结果进行了归一化,融合效果很好。
八、赛题思考
清洗数据非常重要,像时间这样的属性非常乱,处理起来也比较麻烦,我们只是简单地进行了处理,如果能够更细致的处理效果应该更好;某些属性,例如hobby,内容太复杂没有使用到,但这个属性肯定包含了许多有价值的信息,但遗憾没有发现一个好的处理方式。
04094.py
# -*- coding: utf-8 -*-
import pandas as pd
import datetime
import sys
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import xgboost as xgb
import re
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc
def xgb_feature(train_set_x, train_set_y, test_set_x, test_set_y):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.02,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'silent': 1
}
dtrain = xgb.DMatrix(train_set_x, label=train_set_y)
dvali = xgb.DMatrix(test_set_x)
model = xgb.train(params, dtrain, num_boost_round=800)
predict = model.predict(dvali)
return predict, model
if __name__ == '__main__':
IS_OFFLine = False
"""读取auth数据"""
train_auth = pd.read_csv('../AI_risk_train_V3.0/train_auth_info.csv', parse_dates=['auth_time'])
# 标记id_card,空为0,不空为1
auth_idcard = train_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_idcard_df = pd.DataFrame()
auth_idcard_df['id'] = train_auth['id']
auth_idcard_df['auth_idcard_df'] = auth_idcard
# 标记phone,空为0,不空为1
auth_phone = train_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加到DataFrame中
auth_phone_df = pd.DataFrame()
auth_phone_df['id'] = train_auth['id']
auth_idcard_df['auth_phone_df'] = auth_phone
"""读取bankcard数据"""
train_bankcard = pd.read_csv('../AI_risk_train_V3.0/train_bankcard_info.csv')
# 不同银行的银行卡(bankname)数目
train_bankcard_bank_count = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bankcard_count': lambda x: len(x)})
# 不同银行卡类型(card_type)数目
train_bankcard_card_count = train_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
{'card_type_count': lambda x: len(set(x))})
# 不同银行卡预留电话(phone)数目
train_bankcard_phone_count = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'phone_count': lambda x: len(set(x))})
"""读取credit数据"""
train_credit = pd.read_csv('../AI_risk_train_V3.0/train_credit_info.csv')
# 评分的反序
train_credit['credit_score_inverse'] = train_credit['credit_score'].map(lambda x: 605 - x)
# 额度-使用值
train_credit['can_use'] = train_credit['quota'] - train_credit['overdraft']
"""读取order数据"""
train_order = pd.read_csv('../AI_risk_train_V3.0/train_order_info.csv', parse_dates=['time_order'])
# 标记amt_order数据,NA或null为nan,否则为本身
train_order['amt_order'] = train_order['amt_order'].map(
lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
# 标记time_order,0、NA或nan为NaT,否则格式化
train_order['time_order'] = train_order['time_order'].map(
lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan') else (
datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x) else (
datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
train_order_time_max = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'train_order_time_max': lambda x: max(x)})
train_order_time_min = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'train_order_time_min': lambda x: min(x)})
train_order_type_zaixian = train_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == ' 在线支付').values].count()).reset_index(name='type_pay_zaixian')
train_order_type_huodao = train_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao')
"""读取地址信息数据"""
train_recieve = pd.read_csv('../AI_risk_train_V3.0/train_recieve_addr_info.csv')
# 截取region字符串前两位
train_recieve['region'] = train_recieve['region'].map(lambda x: str(x)[:2])
tmp_tmp_recieve = pd.crosstab(train_recieve.id, train_recieve.region)
tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
# 根据id分组,fix_phone的数量
tmp_tmp_recieve_phone_count = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
# 根据id分组,fix_phone的去重数量
tmp_tmp_recieve_phone_count_unique = train_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index()
"""读取target数据"""
train_target = pd.read_csv('../AI_risk_train_V3.0/train_target.csv', parse_dates=['appl_sbm_tm'])
"""读取user数据"""
train_user = pd.read_csv('../AI_risk_train_V3.0/train_user_info.csv')
# 标记hobby,nan为0,否则为1
is_hobby = train_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_hobby_df = pd.DataFrame()
is_hobby_df['id'] = train_user['id']
is_hobby_df['is_hobby'] = is_hobby
# 标记id_card,nan为0,否则为1
is_idcard = train_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_idcard_df = pd.DataFrame()
is_idcard_df['id'] = train_user['id']
is_idcard_df['is_hobby'] = is_idcard
# user_birthday
tmp_tmp = train_user[['id', 'birthday']]
tmp_tmp = tmp_tmp.set_index(['id'])
is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
# is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan')
# 格式化birthday
train_user['birthday'] = train_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT)
# 合并基础特征
train_data = pd.merge(train_target, train_auth, on=['id'], how='left')
train_data = pd.merge(train_data, train_user, on=['id'], how='left')
train_data = pd.merge(train_data, train_credit, on=['id'], how='left')
train_data['hour'] = train_data['appl_sbm_tm'].map(lambda x: x.hour)
train_data['month'] = train_data['appl_sbm_tm'].map(lambda x: x.month)
train_data['year'] = train_data['appl_sbm_tm'].map(lambda x: x.year)
train_data['quota_use_ratio'] = train_data['overdraft'] / (train_data['quota'] + 0.01)
train_data['nan_num'] = train_data.isnull().sum(axis=1)
train_data['diff_day'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
train_data['how_old'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1)
# 两个id_card是否相同
auth_idcard = list(train_data['id_card_x'])
user_idcard = list(train_data['id_card_y'])
idcard_result = []
for indexx, uu in enumerate(auth_idcard):
if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(0)
elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(1)
elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
idcard_result.append(2)
else:
ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
if ttt1 == ttt2:
idcard_result.append(3)
if ttt1 != ttt2:
idcard_result.append(4)
train_data['the_same_id'] = idcard_result
train_bankcard_phone_list = train_bankcard.groupby(by=['id'])['phone'].apply(
lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
# 合并
train_data = pd.merge(train_data, train_bankcard_phone_list, on=['id'], how='left')
train_data['exist_phone'] = train_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
train_data['exist_phone'] = train_data['exist_phone'] * 1
train_data = train_data.drop(['bank_phone_list'], axis=1)
"""bank_info"""
bank_name = train_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bank_name_len': lambda x: len(set(x))})
bank_num = train_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg(
{'tail_num_len': lambda x: len(set(x))})
bank_phone_num = train_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'bank_phone_num': lambda x: x.nunique()})
train_data = pd.merge(train_data, bank_name, on=['id'], how='left')
train_data = pd.merge(train_data, bank_num, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_time_max, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_time_min, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_type_zaixian, on=['id'], how='left')
train_data = pd.merge(train_data, train_order_type_huodao, on=['id'], how='left')
train_data = pd.merge(train_data, is_double_, on=['id'], how='left')
train_data = pd.merge(train_data, is_0_0_0, on=['id'], how='left')
train_data = pd.merge(train_data, is_1_1_1, on=['id'], how='left')
train_data = pd.merge(train_data, is_0000_00_00, on=['id'], how='left')
train_data = pd.merge(train_data, is_0001_1_1, on=['id'], how='left')
train_data = pd.merge(train_data, is_hou_in, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
train_data = pd.merge(train_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
train_data = pd.merge(train_data, bank_phone_num, on=['id'], how='left')
train_data = pd.merge(train_data, is_hobby_df, on=['id'], how='left')
train_data = pd.merge(train_data, is_idcard_df, on=['id'], how='left')
train_data = pd.merge(train_data, auth_idcard_df, on=['id'], how='left')
train_data = pd.merge(train_data, auth_phone_df, on=['id'], how='left')
train_data['day_order_max'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_max']).days,
axis=1)
train_data = train_data.drop(['train_order_time_max'], axis=1)
train_data['day_order_min'] = train_data.apply(lambda row: (row['appl_sbm_tm'] - row['train_order_time_min']).days,
axis=1)
train_data = train_data.drop(['train_order_time_min'], axis=1)
"""order_info"""
order_time = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
order_mean = train_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
unit_price_mean = train_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
order_time_set = train_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'order_time_set': lambda x: len(set(x))})
train_data = pd.merge(train_data, order_time, on=['id'], how='left')
train_data = pd.merge(train_data, order_mean, on=['id'], how='left')
train_data = pd.merge(train_data, order_time_set, on=['id'], how='left')
train_data = pd.merge(train_data, unit_price_mean, on=['id'], how='left')
if IS_OFFLine == False:
# 在测试集上验证
train_data = train_data.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
if IS_OFFLine == True:
# 在验证集上验证
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
# 加入One-Hot后的特征
train_data_copy = pd.concat([train_data, dummy_df], axis=1)
train_data_copy = train_data_copy.fillna(0)
# 删除One-Hot前的特征
vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
# 划分训练集和验证集
valid_train_train = vaild_train_data[vaild_train_data.appl_sbm_tm < datetime.datetime(2017, 4, 1)]
valid_train_test = vaild_train_data[vaild_train_data.appl_sbm_tm >= datetime.datetime(2017, 4, 1)]
valid_train_train = valid_train_train.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
valid_train_test = valid_train_test.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
# 训练集特征
vaild_train_x = valid_train_train.drop(['target'], axis=1)
# 验证集特征
vaild_test_x = valid_train_test.drop(['target'], axis=1)
redict_result, modelee = xgb_feature(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result))
sys.exit(23)
"""************************测试集数据处理***********************************"""
"""auth_info"""
test_auth = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_auth_info.csv', parse_dates=['auth_time'])
# 标记id_card,nan为0,否则为1
auth_idcard = test_auth['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_idcard_df = pd.DataFrame()
auth_idcard_df['id'] = test_auth['id']
auth_idcard_df['auth_idcard_df'] = auth_idcard
# 标记phone,nan为0,否则为1
auth_phone = test_auth['phone'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
auth_phone_df = pd.DataFrame()
auth_phone_df['id'] = test_auth['id']
auth_idcard_df['auth_phone_df'] = auth_phone
test_auth['auth_time'].replace('0000-00-00', 'nan', inplace=True)
test_auth['auth_time'] = pd.to_datetime(test_auth['auth_time'])
"""bankcard_info"""
test_bankcard = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_bankcard_info.csv')
# 不同银行的银行卡(bankname)数目
test_bankcard_bank_count = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bankcard_count': lambda x: len(x)})
# 不同银行卡类型(card_type)数目
test_bankcard_card_count = test_bankcard.groupby(by=['id'], as_index=False)['card_type'].agg(
{'card_type_count': lambda x: len(set(x))})
# 不同银行卡预留电话(phone)数目
test_bankcard_phone_count = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'phone_count': lambda x: len(set(x))})
"""credit_info"""
test_credit = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_credit_info.csv')
# 信用评分反序
test_credit['credit_score_inverse'] = test_credit['credit_score'].map(lambda x: 605 - x)
# 额度-使用值
test_credit['can_use'] = test_credit['quota'] - test_credit['overdraft']
"""order_info"""
test_order = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_order_info.csv', parse_dates=['time_order'])
# 标记amt_order数据,NA或null为nan,否则为本身
test_order['amt_order'] = test_order['amt_order'].map(
lambda x: np.nan if ((x == 'NA') | (x == 'null')) else float(x))
# 标记time_order,0、NA或nan为NaT,否则格式化
test_order['time_order'] = test_order['time_order'].map(
lambda x: pd.lib.NaT if (str(x) == '0' or x == 'NA' or x == 'nan')
else (datetime.datetime.strptime(str(x), '%Y-%m-%d %H:%M:%S') if ':' in str(x)
else (datetime.datetime.utcfromtimestamp(int(x[0:10])) + datetime.timedelta(hours=8))))
# 根据id分组后的最大交易时间
test_order_time_max = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'test_order_time_max': lambda x: max(x)})
# 根据id分组后的最小交易时间
test_order_time_min = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'test_order_time_min': lambda x: min(x)})
# 根据id分组后的在线支付的数量
test_order_type_zaixian = test_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '在线支付').values].count()).reset_index(name='type_pay_zaixian')
# 根据id分组后的货到付款的数量
test_order_type_huodao = test_order.groupby(by=['id']).apply(
lambda x: x['type_pay'][(x['type_pay'] == '货到付款').values].count()).reset_index(name='type_pay_huodao')
"""recieve_addr_info"""
test_recieve = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_recieve_addr_info.csv')
# 截取region字符串前两位
test_recieve['region'] = test_recieve['region'].map(lambda x: str(x)[:2])
tmp_tmp_recieve = pd.crosstab(test_recieve.id, test_recieve.region)
tmp_tmp_recieve = tmp_tmp_recieve.reset_index()
# 根据id分组,fix_phone的数量
tmp_tmp_recieve_phone_count = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].count())
tmp_tmp_recieve_phone_count = tmp_tmp_recieve_phone_count.reset_index()
# 根据id分组,fix_phone的去重数量
tmp_tmp_recieve_phone_count_unique = test_recieve.groupby(by=['id']).apply(lambda x: x['fix_phone'].nunique())
tmp_tmp_recieve_phone_count_unique = tmp_tmp_recieve_phone_count_unique.reset_index()
"""test_list"""
test_target = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
"""test_user_info"""
test_user = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_user_info.csv', parse_dates=['birthday'])
# 标记hobby,nan为0,否则为1
is_hobby = test_user['hobby'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_hobby_df = pd.DataFrame()
is_hobby_df['id'] = test_user['id']
is_hobby_df['is_hobby'] = is_hobby
# 标记id_card,nan为0,否则为1
is_idcard = test_user['id_card'].map(lambda x: 0 if str(x) == 'nan' else 1)
# 将标记数据加入到DataFrame中
is_idcard_df = pd.DataFrame()
is_idcard_df['id'] = test_user['id']
is_idcard_df['is_hobby'] = is_idcard
# user_birthday
# 解决关联报错问题
train_user['id'] = pd.to_numeric(train_user['id'], errors='coerce')
tmp_tmp = train_user[['id', 'birthday']]
tmp_tmp = tmp_tmp.set_index(['id'])
is_double_ = tmp_tmp['birthday'].map(lambda x: (str(x) == '--') * 1).reset_index(name='is_double_')
is_0_0_0 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0-0-0') * 1).reset_index(name='is_0_0_0')
is_1_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '1-1-1') * 1).reset_index(name='is_1_1_1')
is_0000_00_00 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0000-00-00') * 1).reset_index(name='is_0000_00_00')
is_0001_1_1 = tmp_tmp['birthday'].map(lambda x: (str(x) == '0001-1-1') * 1).reset_index(name='is_0001_1_1')
is_hou_in = tmp_tmp['birthday'].map(lambda x: ('后' in str(x)) * 1).reset_index(name='is_hou_in')
# is_nan = tmp_tmp['birthday'].map(lambda x:(str(x) == 'nan')*1).reset_index(name='is_nan')
# 格式化birthday
test_user['birthday'] = test_user['birthday'].map(lambda x: datetime.datetime.strptime(str(x), '%Y-%m-%d') if (
re.match('19\d{2}-\d{1,2}-\d{1,2}', str(x)) and '-0' not in str(x)) else pd.lib.NaT)
test_data = pd.merge(test_target, test_auth, on=['id'], how='left')
test_data = pd.merge(test_data, test_user, on=['id'], how='left')
test_data = pd.merge(test_data, test_credit, on=['id'], how='left')
test_data['hour'] = test_data['appl_sbm_tm'].map(lambda x: x.hour)
test_data['month'] = test_data['appl_sbm_tm'].map(lambda x: x.month)
test_data['year'] = test_data['appl_sbm_tm'].map(lambda x: x.year)
test_data['quota_use_ratio'] = test_data['overdraft'] / (test_data['quota'] + 0.01)
test_data['nan_num'] = test_data.isnull().sum(axis=1)
test_data['diff_day'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['auth_time']).days, axis=1)
test_data['how_old'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['birthday']).days / 365, axis=1)
# 两个id_card是否相同
auth_idcard = list(test_data['id_card_x'])
user_idcard = list(test_data['id_card_y'])
idcard_result = []
for indexx, uu in enumerate(auth_idcard):
if (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(0)
elif (str(auth_idcard[indexx]) != 'nan') and (str(user_idcard[indexx]) == 'nan'):
idcard_result.append(1)
elif (str(auth_idcard[indexx]) == 'nan') and (str(user_idcard[indexx]) != 'nan'):
idcard_result.append(2)
else:
ttt1 = str(auth_idcard[indexx])[0] + str(auth_idcard[indexx])[-1]
ttt2 = str(user_idcard[indexx])[0] + str(user_idcard[indexx])[-1]
if ttt1 == ttt2:
idcard_result.append(3)
if ttt1 != ttt2:
idcard_result.append(4)
test_data['the_same_id'] = idcard_result
test_bankcard_phone_list = test_bankcard.groupby(by=['id'])['phone'].apply(
lambda x: list(set(x.tolist()))).reset_index(name='bank_phone_list')
test_data = pd.merge(test_data, test_bankcard_phone_list, on=['id'], how='left')
test_data['exist_phone'] = test_data.apply(lambda x: x['phone'] in x['bank_phone_list'], axis=1)
test_data['exist_phone'] = test_data['exist_phone'] * 1
test_data = test_data.drop(['bank_phone_list'], axis=1)
"""bankcard_info"""
bank_name = test_bankcard.groupby(by=['id'], as_index=False)['bank_name'].agg(
{'bank_name_len': lambda x: len(set(x))})
bank_num = test_bankcard.groupby(by=['id'], as_index=False)['tail_num'].agg({'tail_num_len': lambda x: len(set(x))})
bank_phone_num = test_bankcard.groupby(by=['id'], as_index=False)['phone'].agg(
{'bank_phone_num': lambda x: x.nunique()})
test_data = pd.merge(test_data, bank_name, on=['id'], how='left')
test_data = pd.merge(test_data, bank_num, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_time_max, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_time_min, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_type_zaixian, on=['id'], how='left')
test_data = pd.merge(test_data, test_order_type_huodao, on=['id'], how='left')
test_data = pd.merge(test_data, is_double_, on=['id'], how='left')
test_data = pd.merge(test_data, is_0_0_0, on=['id'], how='left')
test_data = pd.merge(test_data, is_1_1_1, on=['id'], how='left')
test_data = pd.merge(test_data, is_0000_00_00, on=['id'], how='left')
test_data = pd.merge(test_data, is_0001_1_1, on=['id'], how='left')
test_data = pd.merge(test_data, is_hou_in, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count, on=['id'], how='left')
test_data = pd.merge(test_data, tmp_tmp_recieve_phone_count_unique, on=['id'], how='left')
test_data = pd.merge(test_data, bank_phone_num, on=['id'], how='left')
test_data = pd.merge(test_data, is_hobby_df, on=['id'], how='left')
test_data = pd.merge(test_data, is_idcard_df, on=['id'], how='left')
test_data = pd.merge(test_data, auth_idcard_df, on=['id'], how='left')
test_data = pd.merge(test_data, auth_phone_df, on=['id'], how='left')
test_data['day_order_max'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_max']).days,
axis=1)
test_data = test_data.drop(['test_order_time_max'], axis=1)
test_data['day_order_min'] = test_data.apply(lambda row: (row['appl_sbm_tm'] - row['test_order_time_min']).days,
axis=1)
test_data = test_data.drop(['test_order_time_min'], axis=1)
"""order_info"""
order_time = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_time': len})
order_mean = test_order.groupby(by=['id'], as_index=False)['amt_order'].agg({'order_mean': np.mean})
unit_price_mean = test_order.groupby(by=['id'], as_index=False)['unit_price'].agg({'unit_price_mean': np.mean})
order_time_set = test_order.groupby(by=['id'], as_index=False)['time_order'].agg(
{'order_time_set': lambda x: len(set(x))})
test_data = pd.merge(test_data, order_time, on=['id'], how='left')
test_data = pd.merge(test_data, order_mean, on=['id'], how='left')
test_data = pd.merge(test_data, order_time_set, on=['id'], how='left')
test_data = pd.merge(test_data, unit_price_mean, on=['id'], how='left')
test_data = test_data.drop(
['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
test_data['target'] = -1
test_data.to_csv('8288test.csv', index=None)
train_data.to_csv('8288train.csv', index=None)
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0)
dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
train_test_data = train_test_data.drop(dummy_fea, axis=1)
train_train = train_test_data.iloc[:train_data.shape[0], :]
test_test = train_test_data.iloc[train_data.shape[0]:, :]
train_train_x = train_train.drop(['target'], axis=1)
test_test_x = test_test.drop(['target'], axis=1)
predict_result, modelee = xgb_feature(train_train_x, train_train['target'].values, test_test_x, None)
ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
ans['PROB'] = predict_result
ans = ans.drop(['appl_sbm_tm'], axis=1)
minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
ans.to_csv('04094test.csv', index=None)
stacking.py
# -*- coding: utf-8 -*-
from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
from heamy.pipeline import ModelsPipeline
import pandas as pd
import xgboost as xgb
import datetime
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import numpy as np
def xgb_feature(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.02,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 1111,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=800)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict)
def xgb_feature2(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.015,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 11,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=1200)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict)
def xgb_feature3(X_train, y_train, X_test, y_test=None):
# 模型参数
params = {'booster': 'gbtree',
'objective': 'rank:pairwise',
'eval_metric': 'auc',
'eta': 0.01,
'max_depth': 5, # 4 3
'colsample_bytree': 0.7, # 0.8
'subsample': 0.7,
'min_child_weight': 1, # 2 3
'seed': 1,
'silent': 1
}
dtrain = xgb.DMatrix(X_train, label=y_train)
dvali = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain, num_boost_round=2000)
predict = model.predict(dvali)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict)
def et_model(X_train, y_train, X_test, y_test=None):
model = ExtraTreesClassifier(max_features='log2', n_estimators=1000, n_jobs=-1).fit(X_train, y_train)
return model.predict_proba(X_test)[:, 1]
def gbdt_model(X_train, y_train, X_test, y_test=None):
model = GradientBoostingClassifier(learning_rate=0.02, max_features=0.7, n_estimators=700, max_depth=5).fit(X_train,
y_train)
predict = model.predict_proba(X_test)[:, 1]
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict)
def logistic_model(X_train, y_train, X_test, y_test=None):
model = LogisticRegression(penalty='l2').fit(X_train, y_train)
return model.predict_proba(X_test)[:, 1]
def lgb_feature(X_train, y_train, X_test, y_test=None):
lgb_train = lgb.Dataset(X_train, y_train,
categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
'account_grade', 'industry'})
lgb_test = lgb.Dataset(X_test,
categorical_feature={'sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound',
'account_grade', 'industry'})
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'num_leaves': 25,
'learning_rate': 0.01,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 5,
'min_data_in_leaf': 5,
'max_bin': 200,
'verbose': 0,
}
gbm = lgb.train(params,
lgb_train,
num_boost_round=2000)
predict = gbm.predict(X_test)
minmin = min(predict)
maxmax = max(predict)
vfunc = np.vectorize(lambda x: (x - minmin) / (maxmax - minmin))
return vfunc(predict)
if __name__ == '__main__':
VAILD = False
if VAILD == True:
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
# train_data = train_data.drop(
# ['appl_sbm_tm', 'id', 'id_card_x', 'auth_time', 'phone', 'birthday', 'hobby', 'id_card_y'], axis=1)
# 填充
train_data = train_data.fillna(0)
# # 测试数据
# test_data = pd.read_csv('8288test.csv', engine='python')
# # 填充0
# test_data = test_data.fillna(0)
# One-Hot
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_data.loc[:, dummy_fea])
# 合并One-Hot数据
train_data_copy = pd.concat([train_data, dummy_df], axis=1)
# 填充0
train_data_copy = train_data_copy.fillna(0)
# 删除One-Hot之前的数据
vaild_train_data = train_data_copy.drop(dummy_fea, axis=1)
# 训练集
valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
# 测试集
valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
# 训练集特征
vaild_train_x = valid_train_train.drop(['target'], axis=1)
# 测试集特征
vaild_test_x = valid_train_test.drop(['target'], axis=1)
# 逻辑回归模型
redict_result = logistic_model(vaild_train_x, valid_train_train['target'].values, vaild_test_x, None)
print('valid auc', roc_auc_score(valid_train_test['target'].values, redict_result))
# dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound','account_grade','industry']
# for _fea in dummy_fea:
# print(_fea)
# le = LabelEncoder()
# le.fit(train_data[_fea].tolist())
# train_data[_fea] = le.transform(train_data[_fea].tolist())
# train_data_copy = train_data.copy()
# vaild_train_data = train_data_copy
# valid_train_train = vaild_train_data[(vaild_train_data.year <= 2017) & (vaild_train_data.month < 4)]
# valid_train_test = vaild_train_data[(vaild_train_data.year >= 2017) & (vaild_train_data.month >= 4)]
# vaild_train_x = valid_train_train.drop(['target'],axis=1)
# vaild_test_x = valid_train_test.drop(['target'],axis=1)
# redict_result = lgb_feature(vaild_train_x,valid_train_train['target'].values,vaild_test_x,None)
# print('valid auc',roc_auc_score(valid_train_test['target'].values,redict_result))
if VAILD == False:
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
train_data = train_data.fillna(0)
# 测试数据
test_data = pd.read_csv('8288test.csv', engine='python')
test_data = test_data.fillna(0)
# 合并训练数据和测试数据
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0)
# 划分
train_data = train_test_data.iloc[:train_data.shape[0], :]
test_data = train_test_data.iloc[train_data.shape[0]:, :]
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
for _fea in dummy_fea:
print(_fea)
le = LabelEncoder()
le.fit(train_data[_fea].tolist() + test_data[_fea].tolist())
tmp = le.transform(train_data[_fea].tolist() + test_data[_fea].tolist())
train_data[_fea] = tmp[:train_data.shape[0]]
test_data[_fea] = tmp[train_data.shape[0]:]
train_x = train_data.drop(['target'], axis=1)
test_x = test_data.drop(['target'], axis=1)
lgb_dataset = Dataset(train_x, train_data['target'], test_x, use_cache=False)
"""**********************************************************"""
# 训练数据
train_data = pd.read_csv('8288train.csv', engine='python')
train_data = train_data.fillna(0)
# 测试数据
test_data = pd.read_csv('8288test.csv', engine='python')
test_data = test_data.fillna(0)
# 合并训练数据和测试数据
train_test_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
train_test_data = train_test_data.fillna(0)
# One-Hot
dummy_fea = ['sex', 'merriage', 'income', 'qq_bound', 'degree', 'wechat_bound', 'account_grade', 'industry']
dummy_df = pd.get_dummies(train_test_data.loc[:, dummy_fea])
train_test_data = pd.concat([train_test_data, dummy_df], axis=1)
train_test_data = train_test_data.drop(dummy_fea, axis=1)
train_train = train_test_data.iloc[:train_data.shape[0], :]
test_test = train_test_data.iloc[train_data.shape[0]:, :]
train_train_x = train_train.drop(['target'], axis=1)
test_test_x = test_test.drop(['target'], axis=1)
xgb_dataset = Dataset(X_train=train_train_x, y_train=train_train['target'], X_test=test_test_x, y_test=None,
use_cache=False)
# heamy
model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature, name='xgb', use_cache=False)
model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2, name='xgb2', use_cache=False)
model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3, name='xgb3', use_cache=False)
model_lgb = Regressor(dataset=lgb_dataset, estimator=lgb_feature, name='lgb', use_cache=False)
model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model, name='gbdt', use_cache=False)
pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_gbdt)
stack_ds = pipeline.stack(k=5, seed=111, add_diff=False, full_test=True)
stacker = Regressor(dataset=stack_ds, estimator=LinearRegression, parameters={'fit_intercept': False})
predict_result = stacker.predict()
ans = pd.read_csv('../AI_Risk_data_Btest_V2.0/test_list.csv', parse_dates=['appl_sbm_tm'])
ans['PROB'] = predict_result
ans = ans.drop(['appl_sbm_tm'], axis=1)
minmin, maxmax = min(ans['PROB']), max(ans['PROB'])
ans['PROB'] = ans['PROB'].map(lambda x: (x - minmin) / (maxmax - minmin))
ans['PROB'] = ans['PROB'].map(lambda x: '%.4f' % x)
ans.to_csv('ans_stacking.csv', index=None)
Reference:https://github.com/chenkkkk/User-loan-risk-prediction
马上AI全球挑战者大赛-违约用户风险预测的更多相关文章
- 2018科大讯飞AI营销算法大赛全面来袭,等你来战!
AI技术已成为推动营销迭代的重要驱动力.AI营销高速发展的同时,积累了海量的广告数据和用户数据.如何有效应用这些数据,是大数据技术落地营销领域的关键,也是检测智能营销平台竞争力的标准. 讯飞AI营销云 ...
- 华为AI应用创新大赛即将开启!公开课已备好!
为鼓励开发者创新,挖掘前沿创新能力的应用及服务,帮开发者打造爆款应用的同时丰富终端消费者的用户体验,由设立10亿激励基金耀星计划扶持的华为创新竞赛平台即将开启. 竞赛平台将滚动推出AI.HAG.AR. ...
- 参加2018之江杯全球人工智能大赛
:视频识别&问答
学习了一段时间的AI,用天池大赛来检验一下自己的学习成果. 题目:参赛者需对给定的短视频进行内容识别和分析,并回答每一个视频对应的问题.细节请到阿里天池搜索. 两种思路 1 将视频截成一帧一帧的图片, ...
- 参加2018之江杯全球人工智能大赛 :视频识别&问答(四)
很遗憾没有在规定的时间点(2018-9-25 12:00:00)完成所有的功能并上传数据,只做到写了模型代码并只跑了一轮迭代,现将代码部分贴出. import keras from keras.lay ...
- GitHub 干货 | 各大数据竞赛 Top 解决方案开源汇总
AI 科技评论编者按:现在,越来越多的企业.高校以及学术组织机构通过举办各种类型的数据竞赛来「物色」数据科学领域的优秀人才,并借此激励他们为某一数据领域或应用场景找到具有突破性意义的方案,也为之后的数 ...
- Imagine Cup 微软“创新杯”全球学生科技大赛
一. 介绍 微软创新杯微博:http://blog.sina.com.cn/u/1733906825 官方站点:https://www.microsoft.com/china/msdn/student ...
- 如何利用AI识别未知——加入未知类(不太靠谱),检测待识别数据和已知样本数据的匹配程度(例如使用CNN降维,再用knn类似距离来实现),将问题转化为特征搜索问题而非决策问题,使用HTM算法(记忆+模式匹配预测就是智能),GAN异常检测,RBF
https://www.researchgate.net/post/How_to_determine_unknown_class_using_neural_network 里面有讨论,说是用rbf神经 ...
- AI 预测蛋白质结构「GitHub 热点速览 v.21.29」
作者:HelloGitHub-小鱼干 虽然 AI 领域藏龙卧虎,但是本周预测蛋白质结构的 alphafold 一开源出来就刷爆了朋友圈,虽然项目与我无关,但是看着科技进步能探寻到生命机理,吃瓜群众也有 ...
- 活动招募 HUAWEI HiAI公开课·北京站-如何在4小时把你的APP变身AI应用
人工智能和机器学习是全球关注的新趋势,也是当前最火爆.最流行的话题.当你拿手机用语音助手帮你点外卖,智能推荐帮你把周边美食一网打尽:当你拿起P20拍照时,它将自动识别场景进行最美优化,让你成为摄影大师 ...
随机推荐
- 矩阵取数问题(dp,高精)
题目描述 帅帅经常跟同学玩一个矩阵取数游戏:对于一个给定的n×mn \times mn×m的矩阵,矩阵中的每个元素ai,ja_{i,j}ai,j均为非负整数.游戏规则如下: 每次取数时须从每行各取走 ...
- Hadoop 跨集群访问
[原文地址] 跨集群访问 发表于 2015-06-01 | 简单总结下跨集群访问的多种方式. 跨集群访问HDFS 直接给出HDFS URI 我们平常执行hadoop fs -ls /之类的操作 ...
- python全栈开发 * 36知识点汇总 * 180721
36 操作系统的发展史 进程一.手工操作--穿孔卡片 1.内容: 程序员将对应于程序和数据的已穿孔的纸带(或卡片)装入输入机,然后启动输入机把程序和数据输入计算机内存,接着通过控制 台开关启动程序针对 ...
- A股时间窗口
春节躁动行情:大消费,文娱影视 一号文件:泛农业股 两会行情:两会概念 糖酒会:白酒,糖业 五穷六绝:半年节点,[市场缺钱] 暑期档:文娱影视 国庆行情:军工,文娱影视 年底:阳历年底,[市场缺钱] ...
- 安装ipa文件
https://www.jianshu.com/p/419a35f9533a 1.通过iTunes直接拖动到左侧的侧边栏(未尝试) 2.通过Xcode点击进入Devices管理,添加ipa文件进行安装 ...
- WebSocket 学习教程(二):Spring websocket实现消息推送
=============================================== 环境介绍: Jdk 1.7 (1.6不支持) Tomcat7.0.52 (支持Websocket协议) ...
- RoboWare Studio 安装
RoboWare Studio是一个ROS集成开发环境.它使 ROS开发更加直观.简单.并且易于操作.可进行ROS工作区及包的管理.代码编辑.构建及调试. 下载链接:https://pan.baidu ...
- element-ui table表格展开行每次只能展开一行
https://www.jianshu.com/p/a59c22202f2c <template> <el-table @expand-change="expandSele ...
- 建立请求号 request
1:获取TR号(一般由团队的负责人创建,发出) 2:进入 i7p系统 3:点击process 4:输入tr号 5:选中 正确的请求号,右键> process item> add task ...
- 从Joda-Time反观Java语言利弊
基本上每个企业应用系统都涉及到时间处理.我们知道,以前用java原生的Date+Calendar非常的不方便.后来Joda-Time诞生,这个专门处理日期/时间的库提供了DateTime类型,用它可以 ...