python风控建模实战lendingClub(博主录制,catboost,lightgbm建模,2K超清分辨率)

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

## 1. Data Lending Club 2016年Q3数据:https://www.lendingclub.com/info/download-data.action

参考:http://kldavenport.com/lending-club-data-analysis-revisted-with-python/

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. %matplotlib inline
  1. df = pd.read_csv("./LoanStats_2016Q3.csv",skiprows=1,low_memory=False)
  1. df.info()
  1. df.head(3)
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_il_6m sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog
0 NaN NaN 15000.0 15000.0 15000.0 36 months 13.99% 512.60 C C3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 2600.0 2600.0 2600.0 36 months 8.99% 82.67 B B1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN 32200.0 32200.0 32200.0 60 months 21.49% 880.02 D D5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 122 columns

## 2. Keep what we need

  1. # .ix[row slice, column slice]
  2. df.ix[:4,:7]
  id member_id loan_amnt funded_amnt funded_amnt_inv term int_rate
0 NaN NaN 15000.0 15000.0 15000.0 36 months 13.99%
1 NaN NaN 2600.0 2600.0 2600.0 36 months 8.99%
2 NaN NaN 32200.0 32200.0 32200.0 60 months 21.49%
3 NaN NaN 10000.0 10000.0 10000.0 36 months 11.49%
4 NaN NaN 6000.0 6000.0 6000.0 36 months 13.49%
  1. df.drop('id',1,inplace=True)
  2. df.drop('member_id',1,inplace=True)
  1. df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float)
  1. df.ix[:4,:7]
  loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade
0 15000.0 15000.0 15000.0 36 months 13.99 512.60 C
1 2600.0 2600.0 2600.0 36 months 8.99 82.67 B
2 32200.0 32200.0 32200.0 60 months 21.49 880.02 D
3 10000.0 10000.0 10000.0 36 months 11.49 329.72 B
4 6000.0 6000.0 6000.0 36 months 13.49 203.59 C

### Loan Amount Requested Verus the Funded Amount

  1. print (df.loan_amnt != df.funded_amnt).value_counts()

False 99120 True 4 dtype: int64

  1. df.query('loan_amnt != funded_amnt').head(5)
  loan_amnt funded_amnt funded_amnt_inv term int_rate installment grade sub_grade emp_title emp_length sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_mort_acc sec_app_open_acc sec_app_revol_util sec_app_open_il_6m sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog
99120 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
99121 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
99122 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
99123 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

4 rows × 120 columns

  1. df.dropna(axis=0, how='all',inplace=True)
  1. df.info()
  1. df.dropna(axis=1, how='all',inplace=True)
  1. df.info()
  1. df.ix[:5,8:15]
  emp_title emp_length home_ownership annual_inc verification_status issue_d loan_status
0 Fiscal Director 2 years RENT 55000.0 Not Verified Sep-16 Current
1 Loaner Coordinator 3 years RENT 35000.0 Source Verified Sep-16 Fully Paid
2 warehouse/supervisor 10+ years MORTGAGE 65000.0 Not Verified Sep-16 Fully Paid
3 Teacher 10+ years OWN 55900.0 Not Verified Sep-16 Current
4 SERVICE MGR 5 years RENT 33000.0 Not Verified Sep-16 Current
5 General Manager 10+ years MORTGAGE 109000.0 Source Verified Sep-16 Current

### emp_title: employment title

  1. print df.emp_title.value_counts().head()
  2. print df.emp_title.value_counts().tail()
  3. df.emp_title.unique().shape

Teacher 1931 Manager 1701 Owner 990 Supervisor 785 Driver 756 Name: emp_title, dtype: int64 Agent Services Representative 1 Operator Bridge Tunnel 1 Reg Medical Assistant/Referral Spec. 1 Home Health Care 1 rounds cook 1 Name: emp_title, dtype: int64 (37421,)

  1. df.drop(['emp_title'],1, inplace=True)
  1. df.ix[:5,8:15]
  emp_length home_ownership annual_inc verification_status issue_d loan_status pymnt_plan
0 2 years RENT 55000.0 Not Verified Sep-16 Current n
1 3 years RENT 35000.0 Source Verified Sep-16 Fully Paid n
2 10+ years MORTGAGE 65000.0 Not Verified Sep-16 Fully Paid n
3 10+ years OWN 55900.0 Not Verified Sep-16 Current n
4 5 years RENT 33000.0 Not Verified Sep-16 Current n
5 10+ years MORTGAGE 109000.0 Source Verified Sep-16 Current n

### emp_length: employment length

  1. df.emp_length.value_counts()

10+ years 34219 2 years 9066 3 years 7925

  1. df.replace('n/a', np.nan,inplace=True)
  2. df.emp_length.fillna(value=0,inplace=True)
  3. df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
  4. df['emp_length'] = df['emp_length'].astype(int)
  1. df.emp_length.value_counts()

10 34219 1 14095 2 9066 3 7925 5 6170 4 6022 0 5922 6 4406 8 4168 9 3922 7 3205 Name: emp_length, dtype: int64 ### verification status:”Indicates if income was verified by LC, not verified, or if the income source was verified”

  1. df.verification_status.value_counts()

Source Verified 40781 Verified 31356 Not Verified 26983 Name: verification_status, dtype: int64 ### Target: Loan Statuses

  1. df.info()
  1. df.columns

Index([u’loan_amnt’, u’funded_amnt’, u’funded_amnt_inv’, u’term’, u’int_rate’, u’installment’, u’grade’, u’sub_grade’, u’emp_length’, u’home_ownership’, … u’num_tl_90g_dpd_24m’, u’num_tl_op_past_12m’, u’pct_tl_nvr_dlq’, u’percent_bc_gt_75’, u’pub_rec_bankruptcies’, u’tax_liens’, u’tot_hi_cred_lim’, u’total_bal_ex_mort’, u’total_bc_limit’, u’total_il_high_credit_limit’], dtype=’object’, length=107)

  1. pd.unique(df['loan_status'].values.ravel())

array([‘Current’, ‘Fully Paid’, ‘Late (31-120 days)’, ‘Charged Off’, ‘Late (16-30 days)’, ‘In Grace Period’, ‘Default’], dtype=object)

  1. for col in df.select_dtypes(include=['object']).columns:
  2. print ("Column {} has {} unique instances".format( col, len(df[col].unique())) )

Column term has 2 unique instances Column grade has 7 unique instances Column sub_grade has 35 unique instances Column home_ownership has 4 unique instances Column verification_status has 3 unique instances Column issue_d has 3 unique instances Column loan_status has 7 unique instances Column pymnt_plan has 2 unique instances Column desc has 6 unique instances Column purpose has 13 unique instances Column title has 13 unique instances Column zip_code has 873 unique instances Column addr_state has 50 unique instances Column earliest_cr_line has 614 unique instances Column revol_util has 1087 unique instances Column initial_list_status has 2 unique instances Column last_pymnt_d has 13 unique instances Column next_pymnt_d has 4 unique instances Column last_credit_pull_d has 14 unique instances Column application_type has 3 unique instances Column verification_status_joint has 2 unique instances

  1. # 处理对象类型的缺失,unique
  2. df.select_dtypes(include=['O']).describe().T.\
  3. assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
  count unique top freq missing_pct
term 99120 2 36 months 73898 0.000000
grade 99120 7 C 32846 0.000000
sub_grade 99120 35 B5 8322 0.000000
home_ownership 99120 4 MORTGAGE 46761 0.000000
verification_status 99120 3 Source Verified 40781 0.000000
issue_d 99120 3 Aug-16 36280 0.000000
loan_status 99120 7 Current 79445 0.000000
pymnt_plan 99120 2 n 99074 0.000000
desc 6 5   2 0.999939
purpose 99120 13 debt_consolidation 57682 0.000000
title 93693 12 Debt consolidation 53999 0.054752
zip_code 99120 873 112xx 1125 0.000000
addr_state 99120 50 CA 13352 0.000000
earliest_cr_line 99120 614 Aug-03 796 0.000000
revol_util 99060 1086 0% 440 0.000605
initial_list_status 99120 2 w 71869 0.000000
last_pymnt_d 98991 12 Jun-17 81082 0.001301
next_pymnt_d 83552 3 Jul-17 83527 0.157062
last_credit_pull_d 99115 13 Jun-17 89280 0.000050
application_type 99120 3 INDIVIDUAL 98565 0.000000
verification_status_joint 517 1 Not Verified 517 0.994784
  1. df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float)
  1. # missing_pct
  2. df.drop('desc',1,inplace=True)
  3. df.drop('verification_status_joint',1,inplace=True)
  1. df.drop('zip_code',1,inplace=True)
  2. df.drop('addr_state',1,inplace=True)
  3. df.drop('earliest_cr_line',1,inplace=True)
  4. df.drop('revol_util',1,inplace=True)
  5. df.drop('purpose',1,inplace=True)
  6. df.drop('title',1,inplace=True)
  7. df.drop('term',1,inplace=True)
  8. df.drop('issue_d',1,inplace=True)
  9. # df.drop('',1,inplace=True)
  10. # 贷后相关的字段
  11. df.drop(['out_prncp','out_prncp_inv','total_pymnt',
  12. 'total_pymnt_inv','total_rec_prncp', 'grade', 'sub_grade'] ,1, inplace=True)
  13. df.drop(['total_rec_int','total_rec_late_fee',
  14. 'recoveries','collection_recovery_fee',
  15. 'collection_recovery_fee' ],1, inplace=True)
  16. df.drop(['last_pymnt_d','last_pymnt_amnt',
  17. 'next_pymnt_d','last_credit_pull_d'],1, inplace=True)
  18. df.drop(['policy_code'],1, inplace=True)
  1. df.info()
  1. df.ix[:5,:10]
  loan_amnt funded_amnt funded_amnt_inv int_rate installment emp_length home_ownership annual_inc verification_status loan_status
0 15000.0 15000.0 15000.0 13.99 512.60 2 RENT 55000.0 Not Verified Current
1 2600.0 2600.0 2600.0 8.99 82.67 3 RENT 35000.0 Source Verified Fully Paid
2 32200.0 32200.0 32200.0 21.49 880.02 10 MORTGAGE 65000.0 Not Verified Fully Paid
3 10000.0 10000.0 10000.0 11.49 329.72 10 OWN 55900.0 Not Verified Current
4 6000.0 6000.0 6000.0 13.49 203.59 5 RENT 33000.0 Not Verified Current
5 30000.0 30000.0 30000.0 13.99 697.90 10 MORTGAGE 109000.0 Source Verified Current
  1. df.ix[:5,10:21]
  pymnt_plan dti delinq_2yrs inq_last_6mths mths_since_last_delinq mths_since_last_record open_acc pub_rec revol_bal total_acc initial_list_status
0 n 23.78 1.0 0.0 7.0 NaN 22.0 0.0 21345.0 43.0 f
1 n 6.73 0.0 0.0 NaN NaN 14.0 0.0 720.0 24.0 w
2 n 11.71 0.0 1.0 NaN 87.0 17.0 1.0 11987.0 34.0 w
3 n 26.21 0.0 2.0 NaN NaN 15.0 0.0 17209.0 62.0 w
4 n 19.05 0.0 0.0 NaN NaN 3.0 0.0 4576.0 11.0 f
5 n 16.24 0.0 0.0 NaN NaN 17.0 0.0 11337.0 39.0 w
  1. print df.columns
  2. print df.head(1).values
  3. df.info()

Index([u’loan_amnt’, u’funded_amnt’, u’funded_amnt_inv’, u’int_rate’, u’installment’, u’emp_length’, u’home_ownership’, u’annual_inc’, u’verification_status’, u’loan_status’, u’pymnt_plan’, u’dti’, u’delinq_2yrs’, u’inq_last_6mths’, u’mths_since_last_delinq’, u’mths_since_last_record’, u’open_acc’, u’pub_rec’, u’revol_bal’, u’total_acc’, u’initial_list_status’, u’collections_12_mths_ex_med’, u’mths_since_last_major_derog’, u’application_type’, u’annual_inc_joint’, u’dti_joint’, u’acc_now_delinq’, u’tot_coll_amt’, u’tot_cur_bal’, u’open_acc_6m’, u’open_il_6m’, u’open_il_12m’, u’open_il_24m’, u’mths_since_rcnt_il’, u’total_bal_il’, u’il_util’, u’open_rv_12m’, u’open_rv_24m’, u’max_bal_bc’, u’all_util’, u’total_rev_hi_lim’, u’inq_fi’, u’total_cu_tl’, u’inq_last_12m’, u’acc_open_past_24mths’, u’avg_cur_bal’, u’bc_open_to_buy’, u’bc_util’, u’chargeoff_within_12_mths’, u’delinq_amnt’, u’mo_sin_old_il_acct’, u’mo_sin_old_rev_tl_op’, u’mo_sin_rcnt_rev_tl_op’, u’mo_sin_rcnt_tl’, u’mort_acc’, u’mths_since_recent_bc’, u’mths_since_recent_bc_dlq’, u’mths_since_recent_inq’, u’mths_since_recent_revol_delinq’, u’num_accts_ever_120_pd’, u’num_actv_bc_tl’, u’num_actv_rev_tl’, u’num_bc_sats’, u’num_bc_tl’, u’num_il_tl’, u’num_op_rev_tl’, u’num_rev_accts’, u’num_rev_tl_bal_gt_0’, u’num_sats’, u’num_tl_120dpd_2m’, u’num_tl_30dpd’, u’num_tl_90g_dpd_24m’, u’num_tl_op_past_12m’, u’pct_tl_nvr_dlq’, u’percent_bc_gt_75’, u’pub_rec_bankruptcies’, u’tax_liens’, u’tot_hi_cred_lim’, u’total_bal_ex_mort’, u’total_bc_limit’, u’total_il_high_credit_limit’], dtype=’object’) [[15000.0 15000.0 15000.0 13.99 512.6 2 ‘RENT’ 55000.0 ‘Not Verified’ ‘Current’ ‘n’ 23.78 1.0 0.0 7.0 nan 22.0 0.0 21345.0 43.0 ‘f’ 0.0 nan ‘INDIVIDUAL’ nan nan 0.0 0.0 140492.0 3.0 10.0 2.0 3.0 11.0 119147.0 101.0 3.0 4.0 14612.0 83.0 39000.0 1.0 6.0 0.0 7.0 6386.0 9645.0 73.1 0.0 0.0 157.0 248.0 4.0 4.0 0.0 4.0 7.0 22.0 7.0 0.0 5.0 9.0 6.0 7.0 25.0 11.0 18.0 9.0 22.0 0.0 0.0 0.0 5.0 100.0 33.3 0.0 0.0 147587.0 140492.0 30200.0 108587.0]]

  1. df.select_dtypes(include=['float']).describe().T.\
  2. assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))

/Users/ting/anaconda/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile RuntimeWarning)

  count mean std min 25% 50% 75% max missing_pct
loan_amnt 99120.0 14170.570521 8886.138758 1000.00 7200.00 12000.00 20000.00 40000.00 0.000000
funded_amnt 99120.0 14170.570521 8886.138758 1000.00 7200.00 12000.00 20000.00 40000.00 0.000000
funded_amnt_inv 99120.0 14166.087823 8883.301328 1000.00 7200.00 12000.00 20000.00 40000.00 0.000000
int_rate 99120.0 13.723641 4.873910 5.32 10.49 12.79 15.59 30.99 0.000000
installment 99120.0 432.718654 272.678596 30.12 235.24 361.38 569.83 1535.71 0.000000
annual_inc 99120.0 78488.850081 72694.186060 0.00 48000.00 65448.00 94000.00 8400000.00 0.000000
dti 99120.0 18.348651 64.057603 0.00 11.91 17.60 23.90 9999.00 0.000000
delinq_2yrs 99120.0 0.381901 0.988996 0.00 0.00 0.00 0.00 21.00 0.000000
inq_last_6mths 99120.0 0.570521 0.863796 0.00 0.00 0.00 1.00 5.00 0.000000
mths_since_last_delinq 53366.0 33.229172 21.820407 0.00 NaN NaN NaN 142.00 0.461602
mths_since_last_record 19792.0 67.267886 24.379343 0.00 NaN NaN NaN 119.00 0.800323
open_acc 99120.0 11.718251 5.730585 1.00 8.00 11.00 15.00 86.00 0.000000
pub_rec 99120.0 0.266596 0.719193 0.00 0.00 0.00 0.00 61.00 0.000000
revol_bal 99120.0 15536.628047 21537.790599 0.00 5657.00 10494.00 18501.50 876178.00 0.000000
total_acc 99120.0 24.033545 11.929761 2.00 15.00 22.00 31.00 119.00 0.000000
collections_12_mths_ex_med 99120.0 0.021640 0.168331 0.00 0.00 0.00 0.00 10.00 0.000000
mths_since_last_major_derog 29372.0 44.449612 22.254529 0.00 NaN NaN NaN 165.00 0.703672
annual_inc_joint 517.0 118120.418472 51131.323819 26943.12 NaN NaN NaN 400000.00 0.994784
dti_joint 517.0 18.637621 6.602016 2.56 NaN NaN NaN 48.58 0.994784
acc_now_delinq 99120.0 0.006709 0.086902 0.00 0.00 0.00 0.00 4.00 0.000000
tot_coll_amt 99120.0 281.797639 1840.699443 0.00 0.00 0.00 0.00 172575.00 0.000000
tot_cur_bal 99120.0 138845.606144 156736.843591 0.00 28689.00 76447.50 207194.75 3764968.00 0.000000
open_acc_6m 99120.0 0.978743 1.176973 0.00 0.00 1.00 2.00 13.00 0.000000
open_il_6m 99120.0 2.825888 3.109225 0.00 1.00 2.00 3.00 43.00 0.000000
open_il_12m 99120.0 0.723467 0.973888 0.00 0.00 0.00 1.00 13.00 0.000000
open_il_24m 99120.0 1.624818 1.656628 0.00 0.00 1.00 2.00 26.00 0.000000
mths_since_rcnt_il 96469.0 21.362531 26.563455 0.00 NaN NaN NaN 503.00 0.026745
total_bal_il 99120.0 35045.324193 41981.617996 0.00 9179.00 23199.00 45672.00 1547285.00 0.000000
il_util 85480.0 71.599158 23.306731 0.00 NaN NaN NaN 1000.00 0.137611
open_rv_12m 99120.0 1.408142 1.570068 0.00 0.00 1.00 2.00 24.00 0.000000
mo_sin_old_rev_tl_op 99120.0 177.634322 95.327498 3.00 115.00 160.00 227.00 901.00 0.000000
mo_sin_rcnt_rev_tl_op 99120.0 13.145369 16.695022 0.00 3.00 8.00 16.00 274.00 0.000000
mo_sin_rcnt_tl 99120.0 7.833232 8.649843 0.00 3.00 5.00 10.00 268.00 0.000000
mort_acc 99120.0 1.467585 1.799513 0.00 0.00 1.00 2.00 45.00 0.000000
mths_since_recent_bc 98067.0 23.623512 31.750632 0.00 NaN NaN NaN 546.00 0.010623
mths_since_recent_bc_dlq 26018.0 38.095280 22.798229 0.00 NaN NaN NaN 162.00 0.737510
mths_since_recent_inq 89254.0 6.626504 5.967648 0.00 NaN NaN NaN 25.00 0.099536
mths_since_recent_revol_delinq 36606.0 34.393132 22.371813 0.00 NaN NaN NaN 165.00 0.630690
num_accts_ever_120_pd 99120.0 0.594703 1.508027 0.00 0.00 0.00 1.00 36.00 0.000000
num_actv_bc_tl 99120.0 3.628218 2.302668 0.00 2.00 3.00 5.00 47.00 0.000000
num_actv_rev_tl 99120.0 5.625272 3.400185 0.00 3.00 5.00 7.00 59.00 0.000000
num_bc_sats 99120.0 4.645581 3.013399 0.00 3.00 4.00 6.00 61.00 0.000000
num_bc_tl 99120.0 7.416041 4.546112 0.00 4.00 7.00 10.00 67.00 0.000000
num_il_tl 99120.0 8.597437 7.528533 0.00 4.00 7.00 11.00 107.00 0.000000
num_op_rev_tl 99120.0 8.198820 4.710348 0.00 5.00 7.00 10.00 79.00 0.000000
num_rev_accts 99120.0 13.726312 7.963791 2.00 8.00 12.00 18.00 104.00 0.000000
num_rev_tl_bal_gt_0 99120.0 5.566293 3.286135 0.00 3.00 5.00 7.00 59.00 0.000000
num_sats 99120.0 11.673497 5.709513 1.00 8.00 11.00 14.00 85.00 0.000000
num_tl_120dpd_2m 95661.0 0.001108 0.035695 0.00 NaN NaN NaN 4.00 0.034897
num_tl_30dpd 99120.0 0.004348 0.068650 0.00 0.00 0.00 0.00 3.00 0.000000
num_tl_90g_dpd_24m 99120.0 0.101332 0.567112 0.00 0.00 0.00 0.00 20.00 0.000000
num_tl_op_past_12m 99120.0 2.254752 1.960084 0.00 1.00 2.00 3.00 24.00 0.000000
pct_tl_nvr_dlq 99120.0 93.262828 9.696646 0.00 90.00 96.90 100.00 100.00 0.000000
percent_bc_gt_75 98006.0 42.681332 36.296425 0.00 NaN NaN NaN 100.00 0.011239
pub_rec_bankruptcies 99120.0 0.150262 0.407706 0.00 0.00 0.00 0.00 8.00 0.000000
tax_liens 99120.0 0.075393 0.517275 0.00 0.00 0.00 0.00 61.00 0.000000
tot_hi_cred_lim 99120.0 172185.283394 175273.669652 2500.00 49130.75 108020.50 248473.25 3953111.00 0.000000
total_bal_ex_mort 99120.0 50818.694078 48976.640478 0.00 20913.00 37747.50 64216.25 1548128.00 0.000000
total_bc_limit 99120.0 20862.228420 20721.900664 0.00 7700.00 14700.00 27000.00 520500.00 0.000000
total_il_high_credit_limit 99120.0 44066.340375 44473.458730 0.00 15750.00 33183.00 58963.25 2000000.00 0.000000

74 rows × 9 columns

  1. df.drop('annual_inc_joint',1,inplace=True)
  2. df.drop('dti_joint',1,inplace=True)
  1. df.select_dtypes(include=['int']).describe().T.\
  2. assign(missing_pct=df.apply(lambda x : (len(x)-x.count())/float(len(x))))
  count mean std min 25% 50% 75% max missing_pct
emp_length 99120.0 5.757092 3.770359 0.0 2.0 6.0 10.0 10.0 0.0

Target: Loan Statuses

  1. df['loan_status'].value_counts()
  2. # .plot(kind='bar')
  1. 79445
  2. Fully Paid 13066
  3. Charged Off 2502
  4. Late (31-120 days) 2245
  5. In Grace Period 1407
  6. Late (16-30 days) 454
  7. Default 1
  8. Name: loan_status, dtype: int64
  1. df.loan_status.replace('Fully Paid', int(1),inplace=True)
  2. df.loan_status.replace('Current', int(1),inplace=True)
  3. df.loan_status.replace('Late (16-30 days)', int(0),inplace=True)
  4. df.loan_status.replace('Late (31-120 days)', int(0),inplace=True)
  5. df.loan_status.replace('Charged Off', np.nan,inplace=True)
  6. df.loan_status.replace('In Grace Period', np.nan,inplace=True)
  7. df.loan_status.replace('Default', np.nan,inplace=True)
  8. # df.loan_status.astype('int')
  9. df.loan_status.value_counts()
  1. 1.0 92511
  2. 0.0 2699
  3. Name: loan_status, dtype: int64
  1. # df.loan_status
  2. df.dropna(subset=['loan_status'],inplace=True)

Highly Correlated Data

  1. cor = df.corr()
  2. cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array
  3. cor = cor.stack()
  4. cor[(cor > 0.55) | (cor < -0.55)]
  1. funded_amnt loan_amnt 1.000000
  2. funded_amnt_inv loan_amnt 0.999994
  3. funded_amnt 0.999994
  4. installment loan_amnt 0.953380
  5. funded_amnt 0.953380
  6. funded_amnt_inv 0.953293
  7. mths_since_last_delinq delinq_2yrs -0.551275
  8. total_acc open_acc 0.722950
  9. mths_since_last_major_derog mths_since_last_delinq 0.685642
  10. open_il_24m open_il_12m 0.760219
  11. total_bal_il open_il_6m 0.566551
  12. open_rv_12m open_acc_6m 0.623975
  13. open_rv_24m open_rv_12m 0.774954
  14. max_bal_bc revol_bal 0.551409
  15. all_util il_util 0.594925
  16. total_rev_hi_lim revol_bal 0.815351
  17. inq_last_12m inq_fi 0.563011
  18. acc_open_past_24mths open_acc_6m 0.553181
  19. open_il_24m 0.570853
  20. open_rv_12m 0.657606
  21. open_rv_24m 0.848964
  22. avg_cur_bal tot_cur_bal 0.828457
  23. bc_open_to_buy total_rev_hi_lim 0.626380
  24. bc_util all_util 0.569469
  25. mo_sin_rcnt_tl mo_sin_rcnt_rev_tl_op 0.606065
  26. mort_acc tot_cur_bal 0.551198
  27. mths_since_recent_bc mo_sin_rcnt_rev_tl_op 0.614262
  28. mths_since_recent_bc_dlq mths_since_last_delinq 0.751613
  29. mths_since_last_major_derog 0.553022
  30. mths_since_recent_revol_delinq mths_since_last_delinq 0.853573
  31. ...
  32. num_sats total_acc 0.720022
  33. num_actv_bc_tl 0.552957
  34. num_actv_rev_tl 0.665429
  35. num_bc_sats 0.630778
  36. num_op_rev_tl 0.826946
  37. num_rev_accts 0.663595
  38. num_rev_tl_bal_gt_0 0.668573
  39. num_tl_30dpd acc_now_delinq 0.801444
  40. num_tl_90g_dpd_24m delinq_2yrs 0.669267
  41. num_tl_op_past_12m open_acc_6m 0.722131
  42. open_il_12m 0.557902
  43. open_rv_12m 0.844841
  44. open_rv_24m 0.660265
  45. acc_open_past_24mths 0.774867
  46. pct_tl_nvr_dlq num_accts_ever_120_pd -0.592502
  47. percent_bc_gt_75 bc_util 0.844108
  48. pub_rec_bankruptcies pub_rec 0.580798
  49. tax_liens pub_rec 0.752084
  50. tot_hi_cred_lim tot_cur_bal 0.982693
  51. avg_cur_bal 0.795652
  52. mort_acc 0.560840
  53. total_bal_ex_mort total_bal_il 0.902486
  54. total_bc_limit max_bal_bc 0.581536
  55. total_rev_hi_lim 0.775151
  56. bc_open_to_buy 0.834159
  57. num_bc_sats 0.633461
  58. total_il_high_credit_limit open_il_6m 0.552023
  59. total_bal_il 0.960349
  60. num_il_tl 0.583329
  61. total_bal_ex_mort 0.889238
  62. dtype: float64
  1. df.drop(['funded_amnt','funded_amnt_inv', 'installment'], axis=1, inplace=True)

2. Our Model

  1. from sklearn.model_selection import train_test_split
  2. from sklearn.model_selection import GridSearchCV
  3. from sklearn import ensemble
  4. from sklearn.preprocessing import OneHotEncoder #https://ljalphabeta.gitbooks.io/python-/content/categorical_data.html
  1. Y = df.loan_status
  2. X = df.drop('loan_status',1,inplace=False)
  1. print Y.shape
  2. print sum(Y)
  1. (95210,)
  2. 92511.0
  1. X = pd.get_dummies(X)
  1. print X.columns
  2. print X.head(1).values
  3. X.info()
  1. Index([u'loan_amnt', u'int_rate', u'emp_length', u'annual_inc', u'dti',
  2. u'delinq_2yrs', u'inq_last_6mths', u'mths_since_last_delinq',
  3. u'mths_since_last_record', u'open_acc', u'pub_rec', u'revol_bal',
  4. u'total_acc', u'collections_12_mths_ex_med',
  5. u'mths_since_last_major_derog', u'acc_now_delinq', u'tot_coll_amt',
  6. u'tot_cur_bal', u'open_acc_6m', u'open_il_6m', u'open_il_12m',
  7. u'open_il_24m', u'mths_since_rcnt_il', u'total_bal_il', u'il_util',
  8. u'open_rv_12m', u'open_rv_24m', u'max_bal_bc', u'all_util',
  9. u'total_rev_hi_lim', u'inq_fi', u'total_cu_tl', u'inq_last_12m',
  10. u'acc_open_past_24mths', u'avg_cur_bal', u'bc_open_to_buy', u'bc_util',
  11. u'chargeoff_within_12_mths', u'delinq_amnt', u'mo_sin_old_il_acct',
  12. u'mo_sin_old_rev_tl_op', u'mo_sin_rcnt_rev_tl_op', u'mo_sin_rcnt_tl',
  13. u'mort_acc', u'mths_since_recent_bc', u'mths_since_recent_bc_dlq',
  14. u'mths_since_recent_inq', u'mths_since_recent_revol_delinq',
  15. u'num_accts_ever_120_pd', u'num_actv_bc_tl', u'num_actv_rev_tl',
  16. u'num_bc_sats', u'num_bc_tl', u'num_il_tl', u'num_op_rev_tl',
  17. u'num_rev_accts', u'num_rev_tl_bal_gt_0', u'num_sats',
  18. u'num_tl_120dpd_2m', u'num_tl_30dpd', u'num_tl_90g_dpd_24m',
  19. u'num_tl_op_past_12m', u'pct_tl_nvr_dlq', u'percent_bc_gt_75',
  20. u'pub_rec_bankruptcies', u'tax_liens', u'tot_hi_cred_lim',
  21. u'total_bal_ex_mort', u'total_bc_limit', u'total_il_high_credit_limit',
  22. u'home_ownership_ANY', u'home_ownership_MORTGAGE',
  23. u'home_ownership_OWN', u'home_ownership_RENT',
  24. u'verification_status_Not Verified',
  25. u'verification_status_Source Verified', u'verification_status_Verified',
  26. u'pymnt_plan_n', u'pymnt_plan_y', u'initial_list_status_f',
  27. u'initial_list_status_w', u'application_type_DIRECT_PAY',
  28. u'application_type_INDIVIDUAL', u'application_type_JOINT'],
  29. dtype='object')
  30. [[ 1.50000000e+04 1.39900000e+01 2.00000000e+00 5.50000000e+04
  31. 2.37800000e+01 1.00000000e+00 0.00000000e+00 7.00000000e+00
  32. nan 2.20000000e+01 0.00000000e+00 2.13450000e+04
  33. 4.30000000e+01 0.00000000e+00 nan 0.00000000e+00
  34. 0.00000000e+00 1.40492000e+05 3.00000000e+00 1.00000000e+01
  35. 2.00000000e+00 3.00000000e+00 1.10000000e+01 1.19147000e+05
  36. 1.01000000e+02 3.00000000e+00 4.00000000e+00 1.46120000e+04
  37. 8.30000000e+01 3.90000000e+04 1.00000000e+00 6.00000000e+00
  38. 0.00000000e+00 7.00000000e+00 6.38600000e+03 9.64500000e+03
  39. 7.31000000e+01 0.00000000e+00 0.00000000e+00 1.57000000e+02
  40. 2.48000000e+02 4.00000000e+00 4.00000000e+00 0.00000000e+00
  41. 4.00000000e+00 7.00000000e+00 2.20000000e+01 7.00000000e+00
  42. 0.00000000e+00 5.00000000e+00 9.00000000e+00 6.00000000e+00
  43. 7.00000000e+00 2.50000000e+01 1.10000000e+01 1.80000000e+01
  44. 9.00000000e+00 2.20000000e+01 0.00000000e+00 0.00000000e+00
  45. 0.00000000e+00 5.00000000e+00 1.00000000e+02 3.33000000e+01
  46. 0.00000000e+00 0.00000000e+00 1.47587000e+05 1.40492000e+05
  47. 3.02000000e+04 1.08587000e+05 0.00000000e+00 0.00000000e+00
  48. 0.00000000e+00 1.00000000e+00 1.00000000e+00 0.00000000e+00
  49. 0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  50. 0.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00]]
  51. <class 'pandas.core.frame.DataFrame'>
  52. Int64Index: 95210 entries, 0 to 99119
  53. Data columns (total 84 columns):
  54. loan_amnt 95210 non-null float64
  55. int_rate 95210 non-null float64
  56. emp_length 95210 non-null int64
  57. annual_inc 95210 non-null float64
  58. dti 95210 non-null float64
  59. delinq_2yrs 95210 non-null float64
  60. inq_last_6mths 95210 non-null float64
  61. mths_since_last_delinq 51229 non-null float64
  62. mths_since_last_record 18903 non-null float64
  63. open_acc 95210 non-null float64
  64. pub_rec 95210 non-null float64
  65. revol_bal 95210 non-null float64
  66. total_acc 95210 non-null float64
  67. collections_12_mths_ex_med 95210 non-null float64
  68. mths_since_last_major_derog 28125 non-null float64
  69. acc_now_delinq 95210 non-null float64
  70. tot_coll_amt 95210 non-null float64
  71. tot_cur_bal 95210 non-null float64
  72. open_acc_6m 95210 non-null float64
  73. open_il_6m 95210 non-null float64
  74. open_il_12m 95210 non-null float64
  75. open_il_24m 95210 non-null float64
  76. mths_since_rcnt_il 92660 non-null float64
  77. total_bal_il 95210 non-null float64
  78. il_util 82017 non-null float64
  79. open_rv_12m 95210 non-null float64
  80. open_rv_24m 95210 non-null float64
  81. max_bal_bc 95210 non-null float64
  82. all_util 95204 non-null float64
  83. total_rev_hi_lim 95210 non-null float64
  84. inq_fi 95210 non-null float64
  85. total_cu_tl 95210 non-null float64
  86. inq_last_12m 95210 non-null float64
  87. acc_open_past_24mths 95210 non-null float64
  88. avg_cur_bal 95210 non-null float64
  89. bc_open_to_buy 94160 non-null float64
  90. bc_util 94126 non-null float64
  91. chargeoff_within_12_mths 95210 non-null float64
  92. delinq_amnt 95210 non-null float64
  93. mo_sin_old_il_acct 92660 non-null float64
  94. mo_sin_old_rev_tl_op 95210 non-null float64
  95. mo_sin_rcnt_rev_tl_op 95210 non-null float64
  96. mo_sin_rcnt_tl 95210 non-null float64
  97. mort_acc 95210 non-null float64
  98. mths_since_recent_bc 94212 non-null float64
  99. mths_since_recent_bc_dlq 24968 non-null float64
  100. mths_since_recent_inq 85581 non-null float64
  101. mths_since_recent_revol_delinq 35158 non-null float64
  102. num_accts_ever_120_pd 95210 non-null float64
  103. num_actv_bc_tl 95210 non-null float64
  104. num_actv_rev_tl 95210 non-null float64
  105. num_bc_sats 95210 non-null float64
  106. num_bc_tl 95210 non-null float64
  107. num_il_tl 95210 non-null float64
  108. num_op_rev_tl 95210 non-null float64
  109. num_rev_accts 95210 non-null float64
  110. num_rev_tl_bal_gt_0 95210 non-null float64
  111. num_sats 95210 non-null float64
  112. num_tl_120dpd_2m 91951 non-null float64
  113. num_tl_30dpd 95210 non-null float64
  114. num_tl_90g_dpd_24m 95210 non-null float64
  115. num_tl_op_past_12m 95210 non-null float64
  116. pct_tl_nvr_dlq 95210 non-null float64
  117. percent_bc_gt_75 94156 non-null float64
  118. pub_rec_bankruptcies 95210 non-null float64
  119. tax_liens 95210 non-null float64
  120. tot_hi_cred_lim 95210 non-null float64
  121. total_bal_ex_mort 95210 non-null float64
  122. total_bc_limit 95210 non-null float64
  123. total_il_high_credit_limit 95210 non-null float64
  124. home_ownership_ANY 95210 non-null float64
  125. home_ownership_MORTGAGE 95210 non-null float64
  126. home_ownership_OWN 95210 non-null float64
  127. home_ownership_RENT 95210 non-null float64
  128. verification_status_Not Verified 95210 non-null float64
  129. verification_status_Source Verified 95210 non-null float64
  130. verification_status_Verified 95210 non-null float64
  131. pymnt_plan_n 95210 non-null float64
  132. pymnt_plan_y 95210 non-null float64
  133. initial_list_status_f 95210 non-null float64
  134. initial_list_status_w 95210 non-null float64
  135. application_type_DIRECT_PAY 95210 non-null float64
  136. application_type_INDIVIDUAL 95210 non-null float64
  137. application_type_JOINT 95210 non-null float64
  138. dtypes: float64(83), int64(1)
  139. memory usage: 61.7 MB
  1. X.fillna(0.0,inplace=True)
  2. X.fillna(0,inplace=True)

Train Data & Test Data

  1. x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=123)
  1. print(x_train.shape)
  2. print(y_train.shape)
  3. print(x_test.shape)
  4. print(y_test.shape)
  1. (66647, 84)
  2. (66647,)
  3. (28563, 84)
  4. (28563,)
  1. print y_train.value_counts()
  2. print y_test.value_counts()
  1. 1.0 64712
  2. 0.0 1935
  3. Name: loan_status, dtype: int64
  4. 1.0 27799
  5. 0.0 764
  6. Name: loan_status, dtype: int64

Gradient Boosting Regression Tree

  1. # param_grid = {'learning_rate': [0.1, 0.05, 0.02, 0.01],
  2. # 'max_depth': [1,2,3,4],
  3. # 'min_samples_split': [50,100,200,400],
  4. # 'n_estimators': [100,200,400,800]
  5. # }
  6. param_grid = {'learning_rate': [0.1],
  7. 'max_depth': [2],
  8. 'min_samples_split': [50,100],
  9. 'n_estimators': [100,200]
  10. }
  11. # param_grid = {'learning_rate': [0.1],
  12. # 'max_depth': [4],
  13. # 'min_samples_leaf': [3],
  14. # 'max_features': [1.0],
  15. # }
  16. est = GridSearchCV(ensemble.GradientBoostingRegressor(),
  17. param_grid, n_jobs=4, refit=True)
  18. est.fit(x_train, y_train)
  19. best_params = est.best_params_
  20. print best_params
  1. print best_params
  • 1
  1. {'min_samples_split': 100, 'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 3}
  • 1
  • 2
  1. %%time
  2. est = ensemble.GradientBoostingRegressor(min_samples_split=50,n_estimators=300,
  3. learning_rate=0.1,max_depth=1, random_state=0,loss='ls').\
  4. fit(x_train, y_train)
  • 1
  • 2
  • 3
  • 4
  1. CPU times: user 24.2 s, sys: 251 ms, total: 24.4 s
  2. Wall time: 25.6 s
  • 1
  • 2
  • 3
  1. est.score(x_test,y_test)
  • 1
  1. 0.028311715416075908
  • 1
  • 2
  1. %%time
  2. est = ensemble.GradientBoostingRegressor(min_samples_split=50,n_estimators=100,
  3. learning_rate=0.1,max_depth=2, random_state=0,loss='ls').\
  4. fit(x_train, y_train)
  • 1
  • 2
  • 3
  • 4
  1. CPU times: user 20 s, sys: 272 ms, total: 20.3 s
  2. Wall time: 21.6 s
  • 1
  • 2
  • 3
  1. est.score(x_test,y_test)
  • 1
  1. 0.029210266192750467
  • 1
  • 2
  1. def compute_ks(data):
  2. sorted_list = data.sort_values(['predict'], ascending=[True])
  3. total_bad = sorted_list['label'].sum(axis=None, skipna=None, level=None, numeric_only=None) / 3
  4. total_good = sorted_list.shape[0] - total_bad
  5. # print "total_bad = ", total_bad
  6. # print "total_good = ", total_good
  7. max_ks = 0.0
  8. good_count = 0.0
  9. bad_count = 0.0
  10. for index, row in sorted_list.iterrows():
  11. if row['label'] == 3:
  12. bad_count += 1.0
  13. else:
  14. good_count += 1.0
  15. val = bad_count/total_bad - good_count/total_good
  16. max_ks = max(max_ks, val)
  17. return max_ks
  1. test_pd = pd.DataFrame()
  2. test_pd['predict'] = est.predict(x_test)
  3. test_pd['label'] = y_test
  4. # df['predict'] = est.predict(x_test)
  5. print compute_ks(test_pd[['label','predict']])
  1. 0.0
  1. # Top Ten
  2. feature_importance = est.feature_importances_
  3. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  4. indices = np.argsort(feature_importance)[-10:]
  5. plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
  6. plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
  7. _ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')

Other Model

  1. import xgboost as xgb
  2. from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
  • 1
  • 2
  1. # XGBoost
  2. clf2 = xgb.XGBClassifier(n_estimators=50, max_depth=1,
  3. learning_rate=0.01, subsample=0.8, colsample_bytree=0.3,scale_pos_weight=3.0,
  4. silent=True, nthread=-1, seed=0, missing=None,objective='binary:logistic',
  5. reg_alpha=1, reg_lambda=1,
  6. gamma=0, min_child_weight=1,
  7. max_delta_step=0,base_score=0.5)
  8. clf2.fit(x_train, y_train)
  9. print clf2.score(x_test, y_test)
  10. test_pd2 = pd.DataFrame()
  11. test_pd2['predict'] = clf2.predict(x_test)
  12. test_pd2['label'] = y_test
  13. print compute_ks(test_pd[['label','predict']])
  14. print clf2.feature_importances_
  15. # Top Ten
  16. feature_importance = clf2.feature_importances_
  17. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  18. indices = np.argsort(feature_importance)[-10:]
  19. plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
  20. plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
  21. _ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')
  1. 0.973252109372
  2. 0.0
  3. [ 0. 0.30769232 0. 0. 0. 0. 0.
  4. 0. 0. 0. 0. 0. 0. 0.
  5. 0. 0. 0. 0. 0. 0. 0.
  6. 0. 0. 0. 0. 0. 0. 0.
  7. 0. 0. 0. 0. 0. 0.05128205
  8. 0. 0. 0. 0. 0. 0. 0.
  9. 0. 0. 0. 0. 0. 0. 0.
  10. 0. 0. 0. 0. 0. 0. 0.
  11. 0. 0. 0. 0. 0. 0. 0.
  12. 0. 0. 0. 0. 0. 0. 0.
  13. 0. 0. 0. 0. 0. 0. 0.
  14. 0.05128205 0.30769232 0.2820513 0. 0. 0. 0.
  15. 0. ]

  1. # RFR
  2. clf3 = RandomForestRegressor(n_jobs=-1, max_depth=10,random_state=0)
  3. clf3.fit(x_train, y_train)
  4. print clf3.score(x_test, y_test)
  5. test_pd3 = pd.DataFrame()
  6. test_pd3['predict'] = clf3.predict(x_test)
  7. test_pd3['label'] = y_test
  8. print compute_ks(test_pd[['label','predict']])
  9. print clf3.feature_importances_
  10. # Top Ten
  11. feature_importance = clf3.feature_importances_
  12. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  13. indices = np.argsort(feature_importance)[-10:]
  14. plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
  15. plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
  16. _ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')
  1. 0.0148713087517
  2. 0.0
  3. [ 0.02588781 0.10778862 0.00734994 0.02090219 0.02231172 0.00778016
  4. 0.00556834 0.01097013 0.00734689 0.0017027 0.00622544 0.01140843
  5. 0.00530896 0.00031185 0.01135318 0. 0.01488991 0.01840559
  6. 0.00585621 0.00652523 0.0066759 0.00727607 0.00955013 0.01004672
  7. 0.01785864 0.00855197 0.00985739 0.01477432 0.02184904 0.01816184
  8. 0.00878854 0.02078236 0.01310288 0.00844302 0.01596395 0.01825196
  9. 0.01817367 0.00297759 0.00084823 0.02808718 0.02917066 0.00897034
  10. 0.01139324 0.01532409 0.01467681 0.0032855 0.01066291 0.00581661
  11. 0.00955357 0.00417743 0.01333577 0.00489264 0.0128039 0.01340195
  12. 0.01286394 0.01619219 0.00395603 0.00508973 0. 0.00234757
  13. 0.00378329 0.00502684 0.01732834 0.01178674 0.00030035 0.01189509
  14. 0.00942532 0.00841645 0.01571355 0.00288054 0. 0.0011667
  15. 0.00106548 0.00488734 0. 0.00200132 0.00062765 0.04130873
  16. 0.10076558 0.00022293 0.00165858 0.00308408 0.0008255 0. ]

  1. # XTR
  2. clf4 = ExtraTreesRegressor(n_jobs=-1, max_depth=10,random_state=0)
  3. clf4.fit(x_train, y_train)
  4. print clf4.score(x_test, y_test)
  5. test_pd4 = pd.DataFrame()
  6. test_pd4['predict'] = clf4.predict(x_test)
  7. test_pd4['label'] = y_test
  8. print compute_ks(test_pd[['label','predict']])
  9. print clf4.feature_importances_
  10. # Top Ten
  11. feature_importance = clf4.feature_importances_
  12. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  13. indices = np.argsort(feature_importance)[-10:]
  14. plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4)
  15. plt.yticks(np.arange(10 + 0.25), np.array(X.columns)[indices])
  16. _ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables')
  1. 0.020808034579
  2. 0.0
  3. [ 0.00950112 0.17496689 0.00476969 0.00538677 0.00898343 0.01604885
  4. 0.0139889 0.00605683 0.0042762 0.00358536 0.0144985 0.00915189
  5. 0.00643305 0.00637134 0.0050764 0.00218012 0.00925068 0.00363339
  6. 0.00988441 0.00645297 0.00662444 0.00934969 0.00739012 0.00635592
  7. 0.00633908 0.00923972 0.01263829 0.01190224 0.00914159 0.00402144
  8. 0.00917841 0.01456563 0.01161155 0.01097394 0.00506868 0.00772159
  9. 0.00560163 0.01132941 0.00172528 0.0085601 0.01282485 0.00970629
  10. 0.00956066 0.00731205 0.02087289 0.00430205 0.0062769 0.00765693
  11. 0.00922104 0.00296456 0.00563208 0.00459181 0.0133819 0.00548208
  12. 0.00450864 0.0132415 0.00677772 0.00509891 0.00108962 0.00578448
  13. 0.00934323 0.00715127 0.01078137 0.00855071 0.00695096 0.01488993
  14. 0.00317962 0.00485367 0.00476553 0.00509674 0. 0.00733654
  15. 0.00097223 0.00380448 0.00534715 0.00356893 0.0128526 0.11944538
  16. 0.11758343 0.00195945 0.00225379 0.00243429 0.0007562 0. ]

作业:

1. feature-engineering

2. stacking

3. 画出ROC曲线和KS曲线对比

  1. # 特征工程方法1:histogram
  2. def get_histogram_features(full_dataset):
  3. def extract_histogram(x):
  4. count, _ = np.histogram(x, bins=[0, 10, 100, 1000, 10000, 100000, 1000000, 9000000])
  5. return count
  6. column_names = ["hist_{}".format(i) for i in range(8)]
  7. hist = full_dataset.apply(lambda row: pd.Series(extract_histogram(row)), axis=1)
  8. hist.columns= column_names
  9. RETURN hist
  10. # 特征工程方法2:quantile
  11. q = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
  12. column_names = ["quantile_{}".format(i) for i in q]
  13. # print pd.DataFrame(train_x)
  14. quantile = pd.DataFrame(x_train).quantile(q=q, axis=1).T
  15. quantile.columns = column_names
  16. # 特征工程方法3:cumsum
  17. def get_cumsum_features(all_features):
  18. column_names = ["cumsum_{}".format(i) for i in range(len(all_features))]
  19. cumsum = full_dataset[all_features].cumsum(axis=1)
  20. cumsum.columns = column_names
  21. return cumsum
  22. # 特征工程方法4:特征归一化
  23. from sklearn.preprocessing import MinMaxScaler
  24. Scaler = MinMaxScaler()
  25. x_train_normal = Scaler.fit_transform(x_train_normal)

python信用评分卡建模(附代码,博主录制)

扫描和关注博主二维码,学习免费python视频教学资源

python金融反欺诈-项目实战的更多相关文章

  1. Python爬虫开发与项目实战

    Python爬虫开发与项目实战(高清版)PDF 百度网盘 链接:https://pan.baidu.com/s/1MFexF6S4No_FtC5U2GCKqQ 提取码:gtz1 复制这段内容后打开百度 ...

  2. Python爬虫开发与项目实战pdf电子书|网盘链接带提取码直接提取|

    Python爬虫开发与项目实战从基本的爬虫原理开始讲解,通过介绍Pthyon编程语言与HTML基础知识引领读者入门,之后根据当前风起云涌的云计算.大数据热潮,重点讲述了云计算的相关内容及其在爬虫中的应 ...

  3. python工业互联网监控项目实战5—Collector到opcua服务

    本小节演示项目是如何从连接器到获取Tank4C9服务上的设备对象的值,并通过Connector服务的url返回给UI端请求的.另外,实际项目中考虑websocket中间可能因为网络通信等原因出现中断情 ...

  4. python工业互联网监控项目实战4—python opcua

    前面章节我们采用OPC作为设备到上位的信息交互的协议,本章我们介绍跨平台的OPC UA.OPC作为早期的工业通信规范,是基于COM/DCOM的技术实现的,用于设备和软件之间交换数据,最初,OPC标准仅 ...

  5. python工业互联网监控项目实战2—OPC

    OPC(OLE for Process Control)定义:指为了给工业控制系统应用程序之间的通信建立一个接口标准,在工业控制设备与控制软件之间建立统一的数据存取规范.它给工业控制领域提供了一种标准 ...

  6. python数据分析美国大选项目实战(三)

    项目介绍 项目地址:https://www.kaggle.com/fivethirtyeight/2016-election-polls 包含了2015年11月至2016年11月期间对于2016美国大 ...

  7. Python工业互联网监控项目实战3—websocket to UI

    本小节继续演示如何在Django项目中采用早期websocket技术原型来实现把OPC服务端数据实时推送到UI端,让监控页面在另一种技术方式下,实时显示现场设备的工艺数据变化情况.本例我们仍然采用比较 ...

  8. Python轻松入门到项目实战-实用教程

    本课程完全基于Python3讲解,针对广大的Python爱好者与同学录制.通过本课程的学习,可以让同学们在学习Python的过程中少走弯路.整个课程以实例教学为核心,通过对大量丰富的经典实例的讲解.让 ...

  9. Java 18套JAVA企业级大型项目实战分布式架构高并发高可用微服务电商项目实战架构

    Java 开发环境:idea https://www.jianshu.com/p/7a824fea1ce7 从无到有构建大型电商微服务架构三个阶段SpringBoot+SpringCloud+Solr ...

随机推荐

  1. 【建模应用】PCA主成分分析原理详解

    原文载于此:http://blog.csdn.net/zhongkelee/article/details/44064401 一.PCA简介 1. 相关背景 上完陈恩红老师的<机器学习与知识发现 ...

  2. python----函数初识

    一,什么是函数? 现在有这么个情况:python中的len方法不让用了,你怎么办? 来测试一下‘hello word’ 的长度: s1 = "hello world" length ...

  3. Civil 3D 二次开发 翻转曲面高程分析颜色

    不解释,直接上代码及截图. [CommandMethod("RvsSEA")] public void ReverseSurfaceElevationAnalysis() { Ci ...

  4. Node.js 安装与管理

    一.node安装 Windows下,官网下载 Node.js 安装包,运行安装即可, 安装成功后,可查看版本号 node -v 二.npm npm 是 node 包管理工具,随同node一起安装,安装 ...

  5. fpm 打包工具安装调试

    https://github.com/jordansissel/fpm  官方git yum install ruby-devel gcc make rpm-build rubygems gem so ...

  6. BZOJ4205卡牌配对——最大流+建图优化

    题目描述 现在有一种卡牌游戏,每张卡牌上有三个属性值:A,B,C.把卡牌分为X,Y两类,分别有n1,n2张. 两张卡牌能够配对,当且仅当,存在至多一项属性值使得两张卡牌该项属性值互质,且两张卡牌类别不 ...

  7. subprocess 模块

    import subprocess # 就用来执行系统命令 import os cmd = r'dir D:\上海python全栈4期\day23 | findstr "py"' ...

  8. 如何使用JPQL写纯SQL语句

    使用JPQL,需要把SQL语句修改成类似HQL 语句.SQL 查询的是数据库,而JPQL 查询的是对象和属性,在语法上是有些不同的.对于有些用JPQL 无法写出来的查询,还是使用原生SQL写出来方便 ...

  9. 数字平滑 前端插件JS&CSS库

    CDN DEMO 拷贝可用: <!DOCTYPE html> <link rel="stylesheet" href="https://cdn.boot ...

  10. 【HDU5950】Recursive sequence(矩阵快速幂)

    BUPT2017 wintertraining(15) #6F 题意 \(f(1)=a,f(2)=b,f(i)=2*(f(i-2)+f(i-1)+i^4)\) 给定n,a,b ,\(N,a,b < ...