Predict Referendum by sklearn package
Background
Last day we talk about Python Basics in Chinese. Today, we will do data analysis with python and explain in English(No zuo, no die. In this section, we will discuss prominent hypotheses that have been proposed to explain the EU referendum result and how we try to capture them in our empirical analysis.
In this topic, we try to find application of Regularization Linear Regression and Overfitting.
Overfitting: the model is already so complex that it fits the idiosyncrasies of our training data, idiosyncrasies which limit the model's ability to generalize (as measured by the testing error).
we will look at four broad groups of variables:
- EU exposure through immigration, trade and structural funds;
- local public service provision and fiscal consolidation;
- demography and education;
- economic structure, wages and unemployment.
Let's have a brief view of the data
We can use pandas package to load various data file, including stata file (ending with ".dta").
from pandas.io.stata import StataReader, StataWriter
import pandas as pd
stata_data = StataReader("referendum data.dta", convert_categoricals=False)
data = stata_data.read()# basic data
varlist = stata_data.varlist# variable list
value_labels = stata_data.value_labels() # labels/ description of data value
fmtlist = stata_data.fmtlist
variable_labels = stata_data.variable_labels()# labels/ description of the variables
var = [i for i in variable_labels]
var_label = [variable_labels[i] for i in variable_labels]
df_labels = pd.DataFrame({"variable": var, "variable_label": var_label})# we use DataFrame to see the formated table of variable labels
Then, we can see variables meaning each, and the labels include: Total votes of remain/leave, Region of the votes, Population 60 older growth (2001-2011), Population 60 older (2001), Median hourly pay (2005), Median hourly pay change (2005-2015), Non-EU migrant resident share (2001), Non-EU migrant resident growth (2001-2011), Change in low skilled labour force share (2001-2011), Unemployment rate (2015), Total economy EU dependence (2010), Total fiscal cuts (2010-2015), EU Structural Funds per capita (2013) and so on.
Variable selection analysis
Dependent variable(DV): we choose 'Pct_Remain' as our DV as it decide whether remain or leave
Independent variable(IV): we choose 'Region', 'pensionergrowth20012011', 'ResidentAge60plusshare',
'median_hourly_pay2005', 'median_hourly_pay_growth', 'NONEU_2001Migrantshare', 'NONEU_Migrantgrowth',
'unqualifiedsharechange', 'umemployment_rate_aps', 'Total_EconomyEU_dependence', 'TotalImpactFLWAAYR', 'eufundspercapitapayment13'Four types of IV:
- EU exposure through immigration('NONEU_2001Migrantshare', 'NONEU_Migrantgrowth'), trade and structural funds('eufundspercapitapayment13');
- Fiscal consolidation('TotalImpactFLWAAYR');
- demography('Region', 'pensionergrowth20012011', 'ResidentAge60plusshare', 'unqualifiedsharechange');
- economic structure('Total_EconomyEU_dependence'), wages('median_hourly_pay2005', 'median_hourly_pay_growth') and unemployment('umemployment_rate_aps').
IVs = ['Region', 'pensionergrowth20012011', 'ResidentAge60plusshare', 'median_hourly_pay2005', 'median_hourly_pay_growth', 'NONEU_2001Migrantshare', 'NONEU_Migrantgrowth', 'unqualifiedsharechange', 'umemployment_rate_aps', 'Total_EconomyEU_dependence', 'TotalImpactFLWAAYR', 'eufundspercapitapayment13']
df1 = pd.read_stata("referendum data.dta")
df1 = df1.set_index("id")# load the stata data into DataFrame format data (df1) directly.
df1.shape
(382, 109)
We can see the data has many attributes, and this lead problem to our model - overfitting and lacking generalization (generalize to new, unseen cases), we'll solve the problem next.
drop_list = list(set(var) - set(IVs) - {"id"} - {"Pct_Remain"})# to get our df including only selected attributes
df = df1.drop(drop_list, axis = 1)
df = df.dropna()
df.describe()# we can see the data has different magnitude(some less than 1, while some over 100, and one string type for "Region")
Encode the region
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
#Next, we use LabelEncoder().fit_transform to transform Region into norminal variable
from sklearn.preprocessing import LabelEncoder
df['Region'] = LabelEncoder().fit_transform(df['Region'])
df.describe()
Y = pd.DataFrame(df['Pct_Remain'])# we select x and Y out as the IV and DVs
x = df.drop('Pct_Remain', axis = 1)
x.head()# view the x
Standardise the X - varibales
#We see that magnitude of our variables are different, we need to standardise the X - variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(x)
scaled_x = scaler.transform(x)
x_scaled_df = pd.DataFrame({x.columns[i]: scaled_x[:,i] for i in range(12)})
x_scaled_df.head()
Split data into train set and test set
from sklearn.model_selection import train_test_split
seed = 2019
test_size = 0.20
X_train, X_test, Y_train, Y_test = train_test_split(x_scaled_df, y, test_size = test_size, random_state = seed)
Train Model
# Import a range of sklearn functions
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, RidgeCV, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectFromModel
Fit an OLS linear model to the x and y
#Fit an OLS model to the data and view the coefficients
reg_OLS = LinearRegression(normalize=True).fit(X_train, Y_train)
coef = pd.Series(reg_OLS.coef_, index = x_scaled_df.columns)
print(coef)
To compare, we also choose Lasso model - A Regularization Technique
Fit a Lasso model - A Regularization Technique
Lasso model use alpha as the penalty parameter to solve overfitting problem.
LassoCV can be used to iteratively fit the alpha parameter. Try running this now and printing out the coeffs
reg_LassoCV = LassoCV(normalize=True)
reg_LassoCV.fit(X_train,Y_train)
print("Best alpha using built-in LassoCV: %f" % reg_LassoCV.alpha_)
print("Best score using built-in LassoCV: %f" %reg_LassoCV.score(X_train,Y_train))
coef = pd.Series(reg_LassoCV.coef_, index = x_scaled_df.columns)
print(coef)
%matplotlib inline
import matplotlib.pyplot as plt# let's see the coefficient in graph
imp_coef = coef.sort_values()
plt.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
View accuracy of the model
There, we assume when "Pct_Remain" (percentage of those who choose remain) is less than 0.5, people choose to leave while over 0.5, remain.
from sklearn.metrics import *
def predict( model, X_test = X_test, Y_test = Y_test):
"""
Input: trained model, test_IVs(X) and test_DV(Y)
Output: accuracy of the model(as the assumption defines)
"""
Y_test_0_1 = [0 if i<50 else 1 for i in Y_test]
y_pred = model.predict(X_test)
predictions = [0 if i<50 else 1 for i in y_pred]
accuracy = accuracy_score(Y_test_0_1, predictions)
confusion_matrix1 = confusion_matrix(Y_test_0_1, predictions)
print(model)
print("Confusion matrix:\n"+str(confusion_matrix1))
return ("Accuracy: %.2f%%" % (accuracy * 100.0))
predict(reg_OLS, X_test, Y_test)
predict(reg_LassoCV, X_test, Y_test)
We can see that regularized OLS gives same result as OLS model, mainly because we selecte 12 IVs (far more less than 367 observations) by observation or experience. Thus, the train model doesn't meet the overfitting problem.
What if we choose all variables?
df1_all = df1.dropna()
Y_all = df1_all["Pct_Remain"]
X_all = df1_all.drop("Pct_Remain", axis = 1)
for i in X_all.columns:
X_all[i] = LabelEncoder().fit_transform(X_all[i])#Label Encoder
scaler = StandardScaler().fit(X_all)
scaled_x = scaler.transform(X_all)
x_all_scaled_df = pd.DataFrame({X_all.columns[i]: scaled_x[:,i] for i in range(len(X_all.columns))})#standardise X
seed = 2019
test_size = 0.20
X_train_all, X_test_all, Y_train_all, Y_test_all = train_test_split(x_all_scaled_df, Y_all, test_size = test_size, random_state = seed)
reg_OLS_all = LinearRegression(normalize=True).fit(X_train_all, Y_train_all)
reg_LassoCV_all = LassoCV(normalize=True)
reg_LassoCV_all.fit(X_train_all,Y_train_all)
predict(reg_OLS_all, X_test_all, Y_test_all)
predict(reg_LassoCV_all, X_test_all, Y_test_all)
We can see that regularized OLS gives better prediction accuracy than OLS model, mainly because we selecte all 108 IVs (nearly equals 140, the number of train sets). Thus, the trained model meet the overfitting problem, and LASSO solve it well.
Conclusion
- When there are many attributes, be alarm as the traditional Regression model may meet overfitting problem(However, other models such as Decision Tree, Random Forest, Gradient Boosting and so on can solve it easily)
- We have to strike a balance between variance and (inductive) bias: our model needs to have sufficient complexity to model the relationship between the predictors and the response, but it must not fit the idiosyncrasies of our training data.
- Idiosyncrasies which limit the model's ability to generalize (as measured by the testing error).
Further more
We can try various models (such as XGBoost Model, Support Vector Machine, Random Forest, ...) on this dataset, and let's leave it as this blog's homework.
Reference
Predict Referendum by sklearn package的更多相关文章
- sklearn pipeline
sklearn.pipeline pipeline的目的将许多算法模型串联起来,比如将特征提取.归一化.分类组织在一起形成一个典型的机器学习问题工作流. 优点: 1.直接调用fit和predict方法 ...
- 探索sklearn | K均值聚类
1 K均值聚类 K均值聚类是一种非监督机器学习算法,只需要输入样本的特征 ,而无需标记. K均值聚类首先需要随机初始化K个聚类中心,然后遍历每一个样本,将样本归类到最近的一个聚类中,一个聚类中样本特征 ...
- 跟 Google 学 machineLearning [1] -- hello sklearn
时至今日,我才发现 machineLearning 的应用门槛已经被降到了这么低,简直唾手可得.我实在找不到任何理由不对它进入深入了解.如标题,感谢 Google 为这项技术发展作出的贡献.当然,可能 ...
- sklearn—LinearRegression,Ridge,RidgeCV,Lasso线性回归模型简单使用
线性回归 import sklearnfrom sklearn.linear_model import LinearRegression X= [[0, 0], [1, 2], [2, 4]] y = ...
- 机器学习总结-sklearn参数解释
本文转自:lytforgood 机器学习总结-sklearn参数解释 实验数据集选取: 1分类数据选取 load_iris 鸢尾花数据集 from sklearn.datasets import lo ...
- sklearn调用分类算法的评价指标
sklearn分类算法的评价指标调用#二分类问题的算法评价指标import numpy as npimport matplotlib.pyplot as pltimport pandas as pdf ...
- 机器学习实战 | SKLearn最全应用指南
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/41 本文地址:http://www.showmeai.tech/article-det ...
- 用scikit-learn和pandas学习线性回归
对于想深入了解线性回归的童鞋,这里给出一个完整的例子,详细学完这个例子,对用scikit-learn来运行线性回归,评估模型不会有什么问题了. 1. 获取数据,定义问题 没有数据,当然没法研究机器学习 ...
- IRIS数据集的分析-数据挖掘和python入门-零门槛
所有内容都在python源码和注释里,可运行! ########################### #说明: # 撰写本文的原因是,笔者在研究博文“http://python.jobbole.co ...
随机推荐
- 【托业】【怪兽】TEST02
★ overturn v.推翻 ★ disciplinary adj.纪律的; 训练的; 惩罚的; ★disciplined 有纪律的 ★discipline v.纪律 ★outlook 态度 ★pe ...
- Charles 的界面详解
后续补充.......... 一.主导航栏 1.File.Edit.View.Proxy.Tools.Window.Help 2.View栏 (1)structure视图是将网络请求按访问的域名分类: ...
- Tea for Mac(mac笔记软件)中文版
为大家分享一款好用且免费的mac笔记软件,Tea for Mac提供了实时渲染的Markdown,功能全面,支持各种快捷键,使用tea mac版时,在段首打@即可快速插入图片.标题.列表等元素,非常便 ...
- 正则re
1.简介 其实re在基本模块里已经介绍过,但是在爬虫中re是非常重要的,所以在这里再进行详细描述. re在解析html内容时是效率最高的,但是也是最难的,一般来说,都是结合xpath和re使用,这样解 ...
- 极客时间 深入拆解java虚拟机 一至三讲学习总结
为什么要学习java虚拟机 1.学习java虚拟机的本质,是了解java程序是如何被执行且优化的.这样一来,才可以从内部入手,达到高效编程的目的.与此同时,你也可以为学习更深层级.更为核心的java技 ...
- Windows环境下安装Oracle数据库
Windows环境 1.解压文件 1)Oracle下载官网地址: http://www.oracle.com/technetwork/cn/database/enterprise-edition/do ...
- ubuntu下zip文件操作
转自 https://blog.csdn.net/hpu11/article/details/71524013 .zip $ zip -r myfile.zip ./* 将当前目录下的所有文件和文件夹 ...
- django项目实际工作中的配置以及一些有用的小工具(持续更新)
常用pycharm快捷键: https://www.cnblogs.com/luolizhi/p/5610123.html Ctrl + F1 显示错误 Ctrl + Alt + Space ...
- php爬虫入门 - 登录抓取内容
PHP 写爬虫 说实话我也想用Python的,毕竟人家招牌.无奈我Python还停留在看语法的阶段,实在太惭愧,鞭笞一下自己加油学习.这里用php的CURL库进行页面抓取. 同事使用的系统需要先登录, ...
- 搭建Karma+Jasmine的自动化单元测试
最近在打算将以前的代码进行重构,过程中发现自己不写自动化测试代码,而是手动的写,这样并不好,所以就学了Karma+Jasmine的自动化单元测试,以后写代码尽量要写自动化单元测试,也要测一下istan ...