Python scikit-learn (metrics): difference between r2_score and explained_variance_score?
I noticed that that 'r2_score' and 'explained_variance_score' are both build-in sklearn.metrics methods for regression problems.
I was always under the impression that r2_score is the percent variance explained by the model. How is it different from 'explained_variance_score'?
When would you choose one over the other?
Thanks!
OK, look at this example:
In [123]:
#data
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print metrics.explained_variance_score(y_true, y_pred)
print metrics.r2_score(y_true, y_pred)
0.957173447537
0.948608137045
In [124]:
#what explained_variance_score really is
1-np.cov(np.array(y_true)-np.array(y_pred))/np.cov(y_true)
Out[124]:
0.95717344753747324
In [125]:
#what r^2 really is
1-((np.array(y_true)-np.array(y_pred))**2).sum()/(4*np.array(y_true).std()**2)
Out[125]:
0.94860813704496794
In [126]:
#Notice that the mean residue is not 0
(np.array(y_true)-np.array(y_pred)).mean()
Out[126]:
-0.25
In [127]:
#if the predicted values are different, such that the mean residue IS 0:
y_pred=[2.5, 0.0, 2, 7]
(np.array(y_true)-np.array(y_pred)).mean()
Out[127]:
0.0
In [128]:
#They become the same stuff
print metrics.explained_variance_score(y_true, y_pred)
print metrics.r2_score(y_true, y_pred)
0.982869379015
0.982869379015
So, when the mean residue is 0, they are the same. Which one to choose dependents on your needs, that is, is the mean residue suppose to be 0?
Most of the answers I found (including here) emphasize on the difference between R2 and Explained Variance Score, that is: The Mean Residue (i.e. The Mean of Error).
However, there is an important question left behind, that is: Why on earth I need to consider The Mean of Error?
Refresher:
R2: is the Coefficient of Determination which measures the amount of variation explained by the (least-squares) Linear Regression.
You can look at it from a different angle for the purpose of evaluating the predicted values of y
like this:
Varianceactual_y × R2actual_y = Variancepredicted_y
So intuitively, the more R2 is closer to 1
, the more actual_y and predicted_y will have samevariance (i.e. same spread)
As previously mentioned, the main difference is the Mean of Error; and if we look at the formulas, we find that's true:
R2 = 1 - [(Sum of Squared Residuals / n) / Variancey_actual] Explained Variance Score = 1 - [Variance(Ypredicted - Yactual) / Variancey_actual]
in which:
Variance(Ypredicted - Yactual) = (Sum of Squared Residuals - Mean Error) / n
So, obviously the only difference is that we are subtracting the Mean Error from the first formula! ... But Why?
When we compare the R2 Score with the Explained Variance Score, we are basically checking the Mean Error; so if R2 = Explained Variance Score, that means: The Mean Error = Zero!
The Mean Error reflects the tendency of our estimator, that is: the Biased v.s Unbiased Estimation.
In Summary:
If you want to have unbiased estimator so our model is not underestimating or overestimating, you may consider taking Mean of Error into account.
Python scikit-learn (metrics): difference between r2_score and explained_variance_score?的更多相关文章
- scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类 (python代码)
scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import ...
- Scikit Learn: 在python中机器学习
转自:http://my.oschina.net/u/175377/blog/84420#OSC_h2_23 Scikit Learn: 在python中机器学习 Warning 警告:有些没能理解的 ...
- (原创)(三)机器学习笔记之Scikit Learn的线性回归模型初探
一.Scikit Learn中使用estimator三部曲 1. 构造estimator 2. 训练模型:fit 3. 利用模型进行预测:predict 二.模型评价 模型训练好后,度量模型拟合效果的 ...
- (原创)(四)机器学习笔记之Scikit Learn的Logistic回归初探
目录 5.3 使用LogisticRegressionCV进行正则化的 Logistic Regression 参数调优 一.Scikit Learn中有关logistics回归函数的介绍 1. 交叉 ...
- Scikit Learn
Scikit Learn Scikit-Learn简称sklearn,基于 Python 语言的,简单高效的数据挖掘和数据分析工具,建立在 NumPy,SciPy 和 matplotlib 上.
- 笨办法学 Python (Learn Python The Hard Way)
最近在看:笨办法学 Python (Learn Python The Hard Way) Contents: 译者前言 前言:笨办法更简单 习题 0: 准备工作 习题 1: 第一个程序 习题 2: 注 ...
- 学 Python (Learn Python The Hard Way)
学 Python (Learn Python The Hard Way) Contents: 译者前言 前言:笨办法更简单 习题 0: 准备工作 习题 1: 第一个程序 习题 2: 注释和井号 习题 ...
- Python第三方库(模块)"scikit learn"以及其他库的安装
scikit-learn是一个用于机器学习的 Python 模块. 其主页:http://scikit-learn.org/stable/. GitHub地址: https://github.com/ ...
- Linear Regression with Scikit Learn
Before you read This is a demo or practice about how to use Simple-Linear-Regression in scikit-lear ...
随机推荐
- TCP/IP 笔记 - 地址解析协议
地址解析协议(ARP)提供了一种在IPv4地址和各种网络技术使用的硬件地址之间的映射.ARP仅用于IPv4,IPv6使用邻居发现协议,它被合并入ICMPv6.地址解析是发现两个地址之间的映射关系的过程 ...
- Java中锁分类
锁的分类大致如下:公平锁/非公平锁可重入锁/不可重入锁独享锁/共享锁乐观锁/悲观锁分段锁 1.公平锁/非公平锁公平锁就是严格按照线程启动的顺序来执行的,不允许其他线程插队执行的:而非公平锁是允许插队的 ...
- kafka集群管理
1.启动kafka集群 kafka 没有提供同时启动集群中所有节点的执行脚本,这里自定义一个脚本 名称为 kafka-cluster-start.sh 2.关闭节点 kafka自带关闭脚本 kafka ...
- python3爬虫——下载unsplash美图到本地
最近发现一个网站www.unsplash.com ( 没有广告费哈,纯粹觉得不错 ),网页做得很美观,上面也都是一些免费的摄影照片,觉得很好看,就决定利用蹩脚的技能写个爬虫下载图片. 先随意感受一下这 ...
- MYSQL中的COLLATE是什么?
本文由horstxu发表 在mysql中执行show create table <tablename>指令,可以看到一张表的建表语句,example如下: CREATE TABLE `ta ...
- Vxlan学习笔记——原理
1. 为什么需要Vxlan 普通的VLAN数量只有4096个,无法满足大规模云计算IDC的需求,而IDC为何需求那么多VLAN呢,因为目前大部分IDC内部结构主要分为两种L2,L3.L2结构里面,所有 ...
- 使用Pabot并行运行RF案例
一.问题引入 在做接口自动化时随着案例增多,特别是流程类案例增多,特别是asp.net的webform类型的项目,再加上数据库校验也比较耗时,导致RF执行案例时间越来越长,就遇到这样一个问题,705个 ...
- Django之模型层(单表操作)
一.ORM简介 MVC和MTV框架中包含一个重要部分,就是ORM,它实现了数据模型与数据库的解耦,即数据模型的设计不需要依赖于特定的数据库,通过简单的配置就可以轻松更换数据库. ORM是‘对象-关系- ...
- 关于toggle事件委托的处理
当html页面加载后,页面上需要再次动态加载的按钮等事件的绑定,我们有两种处理方案 一.再次加载后进行绑定 二.使用委托进行绑定 而toggle事件是无法直接绑定的,这时可以转化为click的事件,并 ...
- 将第三方包安装到maven本地仓库
今天在做jasper report生成pdf文档的时候,需要引入亚洲字体jar包.maven仓库是有这个jar包,但是在项目pom文件始终不能下载.无奈只有将jar包安装到maven本地仓库. 1 将 ...