Jul 10, 2009; 10:46pm

predict.glm -> which class does it predict?

2 posts
Hi,

I have a question about logistic regression in R.

Suppose I have a small list of proteins P1, P2, P3 that predict a 
two-class target T, say cancer/noncancer. Lets further say I know that I 
can build a simple logistic regression model in R

model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
the Proteins).

This works fine. T is a factored vector with levels cancer, noncancer. 
Proteins are numeric.

Now, I want to use predict.glm to predict a new data.

predict(model, newdata=testsamples, type="response")    (testsamples is 
a small set of new samples).

The result is a vector of the probabilites for each sample in 
testsamples. But probabilty WHAT for? To belong to the first level in T? 
To belong to second level in T?

Is this fallowing expression 
factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
TRUE, when the new sample is classified to Cancer or when it's 
classified to Noncancer? And why not the other way around?

Thank you,

Peter

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:37pm

Re: predict.glm -> which class does it predict?

1330 posts
On Jul 10, 2009, at 9:46 AM, Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a   
> two-class target T, say cancer/noncancer. Lets further say I know   
> that I can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the   
> dataset of the Proteins). 

> This works fine. T is a factored vector with levels cancer,   
> noncancer. Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples   
> is a small set of new samples). 

> The result is a vector of the probabilites for each sample in   
> testsamples. But probabilty WHAT for? To belong to the first level   
> in T? To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's   
> classified to Noncancer? And why not the other way around? 

> Thank you, 

> Peter

As per the Details section of ?glm:

A typical predictor has the form response ~ terms where response is   
the (numeric) response vector and terms is a series of terms which   
specifies a linear predictor forresponse. ***For binomial and   
quasibinomial families the response can also be specified as a factor   
(when the first level denotes failure and all others success)*** or as   
a two-column matrix with the columns giving the numbers of successes   
and failures. A terms specification of the form first + second   
indicates all the terms in first together with all the terms in second   
with any duplicates removed.

So, given your description above, you are predicting   
"noncancer"...that is, you are predicting the probability of the   
second level of the factor ("success"), given the covariates.

If you want to predict "cancer", alter the factor levels thusly:

T <- factor(T, levels = c("noncancer", "cancer"))

By default, R will alpha sort the factor levels, so "cancer" would be   
first.

Think of it in terms of using a 0,1 integer code for absence,presence,   
where you are predicting the probability of a '1', or the presence of   
the event or characteristic of interest.

BTW, using 'T' as the name of the response vector is not a good habit:

> T 
[1] TRUE

'T' is shorthand for the built in R constant TRUE. R is generally   
smart enough to know the difference, but it is better to avoid getting   
into trouble by not using it.

HTH,

Marc Schwartz

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:48pm

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Schüffler-2
Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a 
> two-class target T, say cancer/noncancer. Lets further say I know that I 
> can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
> the Proteins). 

> This works fine. T is a factored vector with levels cancer, noncancer. 
> Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples is 
> a small set of new samples). 

> The result is a vector of the probabilites for each sample in 
> testsamples. But probabilty WHAT for? To belong to the first level in T? 
> To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's 
> classified to Noncancer? And why not the other way around?

It's the probability of the 2nd level of a factor response (termed 
"success" in the documentation, even when your modeling the probability 
of disease or death...), just like when interpreting the logistic 
regression itself.

I find it easiest to sort ut this kind of issue by experimentation in 
simplified situations. E.g.

> x <- sample(c("A","B"),10,replace=TRUE) 
 > x 
  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
 > table(x) 

A B 
4 6

(notice that the relative frequency of B is 0.6)

> glm(x~1,binomial) 
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
In addition: Warning message: 
In model.matrix.default(mt, mf, contrasts) : 
   variable 'x' converted to a factor

(OK, so it won't go without conversion to factor. This is a good thing.)

> glm(factor(x)~1,binomial)

Call:  glm(formula = factor(x) ~ 1, family = binomial)

Coefficients: 
(Intercept) 
      0.4055

Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
Null Deviance:    13.46 
Residual Deviance: 13.46 AIC: 15.46

(The intercept is positive, corresponding to log odds for a probability 
 > 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4))

> predict(glm(factor(x)~1,binomial)) 
         1         2         3         4         5         6         7 
        8 
0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
0.4054651 
         9        10 
0.4054651 0.4054651 
 > predict(glm(factor(x)~1,binomial),type="response") 
   1   2   3   4   5   6   7   8   9  10 
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

As for why it's not the other way around, well, if it had been, then you 
could have asked the same question....

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 1:27am

Re: predict.glm -> which class does it predict?

7686 posts
2009/7/10 Peter Dalgaard <[hidden email]>:

> Peter Schüffler wrote: 
>> 
>> Hi, 
>> 
>> I have a question about logistic regression in R. 
>> 
>> Suppose I have a small list of proteins P1, P2, P3 that predict a 
>> two-class target T, say cancer/noncancer. Lets further say I know that I can 
>> build a simple logistic regression model in R 
>> 
>> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
>> the Proteins). 
>> 
>> This works fine. T is a factored vector with levels cancer, noncancer. 
>> Proteins are numeric. 
>> 
>> Now, I want to use predict.glm to predict a new data. 
>> 
>> predict(model, newdata=testsamples, type="response")    (testsamples is a 
>> small set of new samples). 
>> 
>> The result is a vector of the probabilites for each sample in testsamples. 
>> But probabilty WHAT for? To belong to the first level in T? To belong to 
>> second level in T? 
>> 
>> Is this fallowing expression 
>> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
>> TRUE, when the new sample is classified to Cancer or when it's classified 
>> to Noncancer? And why not the other way around? 

> It's the probability of the 2nd level of a factor response (termed "success" 
> in the documentation, even when your modeling the probability of disease or 
> death...), just like when interpreting the logistic regression itself. 

> I find it easiest to sort ut this kind of issue by experimentation in 
> simplified situations. E.g. 

>> x <- sample(c("A","B"),10,replace=TRUE) 
>> x 
>  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
>> table(x) 
> x 
> A B 
> 4 6 

> (notice that the relative frequency of B is 0.6) 

>> glm(x~1,binomial) 
> Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
> In addition: Warning message: 
> In model.matrix.default(mt, mf, contrasts) : 
>  variable 'x' converted to a factor 

> (OK, so it won't go without conversion to factor. This is a good thing.) 

>> glm(factor(x)~1,binomial) 

> Call:  glm(formula = factor(x) ~ 1, family = binomial) 

> Coefficients: 
> (Intercept) 
>     0.4055 

> Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
> Null Deviance:      13.46 
> Residual Deviance: 13.46        AIC: 15.46 

> (The intercept is positive, corresponding to log odds for a probability > 
> 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4)) 

>> predict(glm(factor(x)~1,binomial)) 
>        1         2         3         4         5         6         7       8 
> 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
> 0.4054651 
>        9        10 
> 0.4054651 0.4054651 
>> predict(glm(factor(x)~1,binomial),type="response") 
>  1   2   3   4   5   6   7   8   9  10 
> 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 

> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question.... 
>

Or more specifically:

> resp <- factor(c("cancer", "noncancer", "noncancer", "noncancer")) 
> mod <- glm(resp ~ 1, family = binomial) 
> predict(mod, type = "response") 
   1    2    3    4 
0.75 0.75 0.75 0.75

and since noncancer occurs 75% of the time in the sample clearly 
its predicting the probability of noncancer.

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 2:10am

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Dalgaard
> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question....

...and come to think about it, it is rather convenient that it meshes 
with the default ordering of levels in factor(x) is x is 0/1 or FALSE/TRUE.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

predict.glm -> which class does it predict?的更多相关文章

  1. CF451C Predict Outcome of the Game 水题

    Codeforces Round #258 (Div. 2) Predict Outcome of the Game C. Predict Outcome of the Game time limit ...

  2. tflearn tensorflow LSTM predict sin function

    from __future__ import division, print_function, absolute_import import tflearn import numpy as np i ...

  3. 如何在R语言中使用Logistic回归模型

    在日常学习或工作中经常会使用线性回归模型对某一事物进行预测,例如预测房价.身高.GDP.学生成绩等,发现这些被预测的变量都属于连续型变量.然而有些情况下,被预测变量可能是二元变量,即成功或失败.流失或 ...

  4. 简单介绍一下R中的几种统计分布及常用模型

    统计学上分布有很多,在R中基本都有描述.因能力有限,我们就挑选几个常用的.比较重要的简单介绍一下每种分布的定义,公式,以及在R中的展示. 统计分布每一种分布有四个函数:d――density(密度函数) ...

  5. Machine Learning for hackers读书笔记(六)正则化:文本回归

    data<-'F:\\learning\\ML_for_Hackers\\ML_for_Hackers-master\\06-Regularization\\data\\' ranks < ...

  6. 统计学习导论:基于R应用——第五章习题

    第五章习题 1. 我们主要用到下面三个公式: 根据上述公式,我们将式子化简为 对求导即可得到得到公式5-6. 2. (a) 1 - 1/n (b) 自助法是有有放回的,所以第二个的概率还是1 - 1/ ...

  7. 统计学习导论:基于R应用——第四章习题

    第四章习题,部分题目未给出答案 1. 这个题比较简单,有高中生推导水平的应该不难. 2~3证明题,略 4. (a) 这个问题问我略困惑,答案怎么直接写出来了,难道不是10%么 (b) 这个答案是(0. ...

  8. R与数据分析旧笔记(⑨)广义线性回归模型

    广义线性回归模型 广义线性回归模型 例题1 R.Norell实验 为研究高压电线对牲畜的影响,R.Norell研究小的电流对农场动物的影响.他在实验中,选择了7头,6种电击强度, 0,1,2,3,4, ...

  9. logistic回归和probit回归预测公司被ST的概率(应用)

    1.适合阅读人群: 知道以下知识点:盒状图.假设检验.逻辑回归的理论.probit的理论.看过回归分析,了解AIC和BIC判别准则.能自己跑R语言程序 2.本文目的:用R语言演示一个相对完整的逻辑回归 ...

随机推荐

  1. [原][译][osgearth]样式表style中参数总结(OE官方文档翻译)

    几何Geometry 高度Altitude 挤压Extrusion 图标Icon 模型Model 渲染Render 皮肤Skin 文本Text 覆盖Coverage 提示: 在SDK中,样式表的命名空 ...

  2. 这些HTML、CSS知识点,面试和平时开发都需要 No10-No11(知识点:表格操作、代码编写规则)

    系列知识点汇总 1.基础篇 这些HTML.CSS知识点,面试和平时开发都需要 No1-No4(知识点:HTML.CSS.盒子模型.内容布局) 这些HTML.CSS知识点,面试和平时开发都需要 No5- ...

  3. 转载:几种 hive join 类型简介

    作为数据分析中经常进行的join 操作,传统DBMS 数据库已经将各种算法优化到了极致,而对于hadoop 使用的mapreduce 所进行的join 操作,去年开始也是有各种不同的算法论文出现,讨论 ...

  4. checkbox选中的问题(Ajax.BeginForm)

    判断checkbox选中的方法方法一:if ($("#checkbox-id")get(0).checked) { // do something} 方法二:if($('#chec ...

  5. Double H2.0

    Double H2.0 https://www.cnblogs.com/wxh9494/p/9879442.html 选题报告 一.项目描述(Project Description) 本项目提供一个公 ...

  6. Ubuntu 16.04系统开机紫屏的解决办法

    具体症状为卡在开机界面,按任何键都无反应. 网上查看了几篇文章 ,如下: 解决:ubuntu16.04启动时长时间停留在紫屏或跳文本的黑屏界面 Ubuntu16.04显卡驱动 电源管理 里面提到的开机 ...

  7. STL标准库-容器-set与multiset

    技术在于交流.沟通,转载请注明出处并保持作品的完整性. set与multiset关联容器 结构如下 set是一种关联容器,key即value,value即key.它是自动排序,排序特点依据key se ...

  8. 记录一些js框架用途

    accounting.min.js 货币格式化alertify.min.js 提示信息库amd.loader.js 按需动态加载js文件angular-cookies.js 处理cookieangul ...

  9. C# 调用C++ DLL 的类型转换(转载版)

    最近在做视频监控相关的demo开发,实现语言是C#,但视频监控的SDK是C++开发的,所以涉及到C#调用C++的dll库.很多结构体.参数在使用时都要先进行转换,由非托管类型转换成托管类型后才能使用. ...

  10. js实现trim()方法

    在面向对象编程里面去除字符串左右空格是很容易的事,可以使用trim().ltrim() 或 rtrim(),在jquery里面使用$.trim()也可以轻松的实现.但是在js中却没有这个方法.下面的实 ...