python信用评分卡(附代码,博主录制)

https://etav.github.io/python/vif_factor_python.html

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

$$ V.I.F. = 1 / (1 - R^2). $$

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

Steps for Implementing VIF

  1. Run a multiple regression.
  2. Calculate the VIF factors.
  3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
#Imports
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv')
df.dropna()
df = df._get_numeric_data() #drop non-numeric cols df.head()
  id member_id loan_amnt funded_amnt funded_amnt_inv int_rate installment annual_inc dti delinq_2yrs ... total_bal_il il_util open_rv_12m open_rv_24m max_bal_bc all_util total_rev_hi_lim inq_fi total_cu_tl inq_last_12m
0 1077501 1296599 5000.0 5000.0 4975.0 10.65 162.87 24000.0 27.65 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1077430 1314167 2500.0 2500.0 2500.0 15.27 59.83 30000.0 1.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1077175 1313524 2400.0 2400.0 2400.0 15.96 84.33 12252.0 8.72 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1076863 1277178 10000.0 10000.0 10000.0 13.49 339.31 49200.0 20.00 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1075358 1311748 3000.0 3000.0 3000.0 12.69 67.79 80000.0 17.94 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 51 columns

df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe

Step 1: Run a multiple regression

%%capture
#gather features
features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression:
y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')

Step 2: Calculate VIF Factors

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

Step 3: Inspect VIF Factors

vif.round(1)
  VIF Factor features
0 5.1 Intercept
1 1.0 dti
2 678.4 funded_amnt
3 678.4 loan_amnt

As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

Variance Inflation Factor (VIF) 方差膨胀因子解释_附python脚本的更多相关文章

  1. 可决系数R^2和方差膨胀因子VIF

    然而很多时候,被筛选的特征在模型上线的预测效果并不理想,究其原因可能是由于特征筛选的偏差. 但还有一个显著的因素,就是选取特征之间之间可能存在高度的多重共线性,导致模型对测试集预测能力不佳. 为了在筛 ...

  2. GWAS: 曼哈顿图,QQ plot 图,膨胀系数( manhattan、Genomic Inflation Factor)

    画曼哈顿图和QQ plot 首推R包“qqman”,简约方便.下面具体介绍以下. 一.画曼哈顿图 install.packages("qqman") library(qqman) ...

  3. Java 序列化Serializable具体解释(附具体样例)

    Java 序列化Serializable具体解释(附具体样例) 1.什么是序列化和反序列化 Serialization(序列化)是一种将对象以一连串的字节描写叙述的过程:反序列化deserializa ...

  4. Cocos2d-x手机游戏开发与项目实践具体解释_随书代码

    Cocos2d-x手机游戏开发与项目实战具体解释_随书代码 作者:沈大海  因为原作者共享的资源为UTF-8字符编码.下载后解压在win下显示乱码或还出现文件不全问题,现完整整理,解决全部乱码问题,供 ...

  5. 杭电 2136 Largest prime factor(最大素数因子的位置)

    Largest prime factor Time Limit: 5000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Oth ...

  6. 斯坦福大学公开课机器学习: advice for applying machine learning | regularization and bais/variance(机器学习中方差和偏差如何相互影响、以及和算法的正则化之间的相互关系)

    算法正则化可以有效地防止过拟合, 但正则化跟算法的偏差和方差又有什么关系呢?下面主要讨论一下方差和偏差两者之间是如何相互影响的.以及和算法的正则化之间的相互关系 假如我们要对高阶的多项式进行拟合,为了 ...

  7. glibc中malloc的详细解释_转

    glibc中的malloc实现: The main properties of the algorithms are:* For large (>= 512 bytes) requests, i ...

  8. cmd /c和cmd /k 解释,附★CMD命令★ 大全

    cmd /c和cmd /k http://leaning.javaeye.com/blog/380810 java的Runtime.getRuntime().exec(commandStr)可以调用执 ...

  9. c语言_文件操作_FILE结构体解释_涉及对操作系统文件FCB操作的解释_

    1. 文件和流的关系 C将每个文件简单地作为顺序字节流(如下图).每个文件用文件结束符结束,或者在特定字节数的地方结束,这个特定的字节数可以存储在系统维护的管理数据结构中.当打开文件时,就建立了和文件 ...

随机推荐

  1. [dev][nginx] 在阅读nginx代码之前都需要准备什么

    前言 以前,我读过nginx的源码,甚至还改过.但是,现在回想起来几乎回想不起任何东西, 只记得到处都是回调和异步,我的vim+ctags索引起来十分吃力. 几乎没有任何收获,都是因为当时打开代码就看 ...

  2. SpringCloud_Eureka与Zookeeper对比

    关系型数据库与非关系型数据库及其特性: RDBMS(Relational Database Management System 关系型数据库) :mysql/oracle/sqlServer等   = ...

  3. SOA=SOME/IP?你低估了这件事 | 第二弹

    ​        哈喽,大家好,第二弹的时间到~上文书说到v-SOA可以通过SOC.SORS和SOS来分解落地,第一弹中已经聊了SOC的实现,这部分也是国内各大OEM正在经历的阶段,第二弹,我们继续聊 ...

  4. Kotlin使用处协变的意义与用法

    在上一次https://www.cnblogs.com/webor2006/p/11294849.html中对于Java的协变和Kotlin的协变提到了它们的区别,回忆一下: 其实在Kotlin中也有 ...

  5. 采集新浪新闻php插件

    今天没事,就分享一个采集新浪新闻PHP插件接口,可用于火车头采集,比较简单,大家可以研究! 新浪新闻实时动态列表为:https://news.sina.com.cn/roll/?qq-pf-to=pc ...

  6. FRCN文本检测(转)

    [源码分析]Text-Detection-with-FRCN 原创 2017年11月21日 17:58:39 标签: 659 编辑 删除 Text-Detection-with-FRCN项目是基于py ...

  7. asp.net+ tinymce粘贴word

    公司做的项目需要用到粘贴Word功能.就是将word内容一键粘贴到网页编辑器(在线富文本编辑器)中.Chrome+IE默认支持粘贴剪切板中的图片,但是我要粘贴的文章存在word里面,图片多达数十张,我 ...

  8. CF446C DZY Loves Fibonacci Numbers 线段树 + 数学

    有两个性质需要知道: $1.$ 对于任意的 $f[i]=f[i-1]+f[i-2]$ 的数列,都有 $f[i]=fib[i-2]\times f[1]+fib[i-1]\times f[2]$ 其中 ...

  9. 2017.10.4 国庆清北 D4T1 财富

    (其实这题是luogu P1901 发射站 原题,而且数据范围还比luogu小) 题目描述 LYK有n个小伙伴.每个小伙伴有一个身高hi. 这个游戏是这样的,LYK生活的环境是以身高为美的环境,因此在 ...

  10. Processing 字体变形

    在Processing中做字体变形通常需要有以下基础知识: 1.PGraphics对象 2.图片像素化 制作过程也不复杂,代码如下: color ELLIPSE_COLOR = color(0); c ...