Jul 10, 2009; 10:46pm

predict.glm -> which class does it predict?

2 posts
Hi,

I have a question about logistic regression in R.

Suppose I have a small list of proteins P1, P2, P3 that predict a 
two-class target T, say cancer/noncancer. Lets further say I know that I 
can build a simple logistic regression model in R

model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
the Proteins).

This works fine. T is a factored vector with levels cancer, noncancer. 
Proteins are numeric.

Now, I want to use predict.glm to predict a new data.

predict(model, newdata=testsamples, type="response")    (testsamples is 
a small set of new samples).

The result is a vector of the probabilites for each sample in 
testsamples. But probabilty WHAT for? To belong to the first level in T? 
To belong to second level in T?

Is this fallowing expression 
factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
TRUE, when the new sample is classified to Cancer or when it's 
classified to Noncancer? And why not the other way around?

Thank you,

Peter

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:37pm

Re: predict.glm -> which class does it predict?

1330 posts
On Jul 10, 2009, at 9:46 AM, Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a   
> two-class target T, say cancer/noncancer. Lets further say I know   
> that I can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the   
> dataset of the Proteins). 

> This works fine. T is a factored vector with levels cancer,   
> noncancer. Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples   
> is a small set of new samples). 

> The result is a vector of the probabilites for each sample in   
> testsamples. But probabilty WHAT for? To belong to the first level   
> in T? To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's   
> classified to Noncancer? And why not the other way around? 

> Thank you, 

> Peter

As per the Details section of ?glm:

A typical predictor has the form response ~ terms where response is   
the (numeric) response vector and terms is a series of terms which   
specifies a linear predictor forresponse. ***For binomial and   
quasibinomial families the response can also be specified as a factor   
(when the first level denotes failure and all others success)*** or as   
a two-column matrix with the columns giving the numbers of successes   
and failures. A terms specification of the form first + second   
indicates all the terms in first together with all the terms in second   
with any duplicates removed.

So, given your description above, you are predicting   
"noncancer"...that is, you are predicting the probability of the   
second level of the factor ("success"), given the covariates.

If you want to predict "cancer", alter the factor levels thusly:

T <- factor(T, levels = c("noncancer", "cancer"))

By default, R will alpha sort the factor levels, so "cancer" would be   
first.

Think of it in terms of using a 0,1 integer code for absence,presence,   
where you are predicting the probability of a '1', or the presence of   
the event or characteristic of interest.

BTW, using 'T' as the name of the response vector is not a good habit:

> T 
[1] TRUE

'T' is shorthand for the built in R constant TRUE. R is generally   
smart enough to know the difference, but it is better to avoid getting   
into trouble by not using it.

HTH,

Marc Schwartz

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 10, 2009; 11:48pm

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Schüffler-2
Peter Schüffler wrote:

> Hi, 

> I have a question about logistic regression in R. 

> Suppose I have a small list of proteins P1, P2, P3 that predict a 
> two-class target T, say cancer/noncancer. Lets further say I know that I 
> can build a simple logistic regression model in R 

> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
> the Proteins). 

> This works fine. T is a factored vector with levels cancer, noncancer. 
> Proteins are numeric. 

> Now, I want to use predict.glm to predict a new data. 

> predict(model, newdata=testsamples, type="response")    (testsamples is 
> a small set of new samples). 

> The result is a vector of the probabilites for each sample in 
> testsamples. But probabilty WHAT for? To belong to the first level in T? 
> To belong to second level in T? 

> Is this fallowing expression 
> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
> TRUE, when the new sample is classified to Cancer or when it's 
> classified to Noncancer? And why not the other way around?

It's the probability of the 2nd level of a factor response (termed 
"success" in the documentation, even when your modeling the probability 
of disease or death...), just like when interpreting the logistic 
regression itself.

I find it easiest to sort ut this kind of issue by experimentation in 
simplified situations. E.g.

> x <- sample(c("A","B"),10,replace=TRUE) 
 > x 
  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
 > table(x) 

A B 
4 6

(notice that the relative frequency of B is 0.6)

> glm(x~1,binomial) 
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
In addition: Warning message: 
In model.matrix.default(mt, mf, contrasts) : 
   variable 'x' converted to a factor

(OK, so it won't go without conversion to factor. This is a good thing.)

> glm(factor(x)~1,binomial)

Call:  glm(formula = factor(x) ~ 1, family = binomial)

Coefficients: 
(Intercept) 
      0.4055

Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
Null Deviance:    13.46 
Residual Deviance: 13.46 AIC: 15.46

(The intercept is positive, corresponding to log odds for a probability 
 > 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4))

> predict(glm(factor(x)~1,binomial)) 
         1         2         3         4         5         6         7 
        8 
0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
0.4054651 
         9        10 
0.4054651 0.4054651 
 > predict(glm(factor(x)~1,binomial),type="response") 
   1   2   3   4   5   6   7   8   9  10 
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

As for why it's not the other way around, well, if it had been, then you 
could have asked the same question....

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 1:27am

Re: predict.glm -> which class does it predict?

7686 posts
2009/7/10 Peter Dalgaard <[hidden email]>:

> Peter Schüffler wrote: 
>> 
>> Hi, 
>> 
>> I have a question about logistic regression in R. 
>> 
>> Suppose I have a small list of proteins P1, P2, P3 that predict a 
>> two-class target T, say cancer/noncancer. Lets further say I know that I can 
>> build a simple logistic regression model in R 
>> 
>> model <- glm(T ~ ., data=d.f(Y), family=binomial)   (Y is the dataset of 
>> the Proteins). 
>> 
>> This works fine. T is a factored vector with levels cancer, noncancer. 
>> Proteins are numeric. 
>> 
>> Now, I want to use predict.glm to predict a new data. 
>> 
>> predict(model, newdata=testsamples, type="response")    (testsamples is a 
>> small set of new samples). 
>> 
>> The result is a vector of the probabilites for each sample in testsamples. 
>> But probabilty WHAT for? To belong to the first level in T? To belong to 
>> second level in T? 
>> 
>> Is this fallowing expression 
>> factor(predict(model, newdata=testsamples, type="response") >= 0.5) 
>> TRUE, when the new sample is classified to Cancer or when it's classified 
>> to Noncancer? And why not the other way around? 

> It's the probability of the 2nd level of a factor response (termed "success" 
> in the documentation, even when your modeling the probability of disease or 
> death...), just like when interpreting the logistic regression itself. 

> I find it easiest to sort ut this kind of issue by experimentation in 
> simplified situations. E.g. 

>> x <- sample(c("A","B"),10,replace=TRUE) 
>> x 
>  [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A" 
>> table(x) 
> x 
> A B 
> 4 6 

> (notice that the relative frequency of B is 0.6) 

>> glm(x~1,binomial) 
> Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1 
> In addition: Warning message: 
> In model.matrix.default(mt, mf, contrasts) : 
>  variable 'x' converted to a factor 

> (OK, so it won't go without conversion to factor. This is a good thing.) 

>> glm(factor(x)~1,binomial) 

> Call:  glm(formula = factor(x) ~ 1, family = binomial) 

> Coefficients: 
> (Intercept) 
>     0.4055 

> Degrees of Freedom: 9 Total (i.e. Null);  9 Residual 
> Null Deviance:      13.46 
> Residual Deviance: 13.46        AIC: 15.46 

> (The intercept is positive, corresponding to log odds for a probability > 
> 0.5 ; i.e.,  must be that "B": 0.4055==log(6/4)) 

>> predict(glm(factor(x)~1,binomial)) 
>        1         2         3         4         5         6         7       8 
> 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 
> 0.4054651 
>        9        10 
> 0.4054651 0.4054651 
>> predict(glm(factor(x)~1,binomial),type="response") 
>  1   2   3   4   5   6   7   8   9  10 
> 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 

> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question.... 
>

Or more specifically:

> resp <- factor(c("cancer", "noncancer", "noncancer", "noncancer")) 
> mod <- glm(resp ~ 1, family = binomial) 
> predict(mod, type = "response") 
   1    2    3    4 
0.75 0.75 0.75 0.75

and since noncancer occurs 75% of the time in the sample clearly 
its predicting the probability of noncancer.

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

Reply | Threaded | More 
Jul 11, 2009; 2:10am

Re: predict.glm -> which class does it predict?

2360 posts
In reply to this post by Peter Dalgaard
> As for why it's not the other way around, well, if it had been, then you 
> could have asked the same question....

...and come to think about it, it is rather convenient that it meshes 
with the default ordering of levels in factor(x) is x is 0/1 or FALSE/TRUE.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B 
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K 
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918 
~~~~~~~~~~ - ([hidden email])              FAX: (+45) 35327907

______________________________________________ 
[hidden email] mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code. 

predict.glm -> which class does it predict?的更多相关文章

  1. CF451C Predict Outcome of the Game 水题

    Codeforces Round #258 (Div. 2) Predict Outcome of the Game C. Predict Outcome of the Game time limit ...

  2. tflearn tensorflow LSTM predict sin function

    from __future__ import division, print_function, absolute_import import tflearn import numpy as np i ...

  3. 如何在R语言中使用Logistic回归模型

    在日常学习或工作中经常会使用线性回归模型对某一事物进行预测,例如预测房价.身高.GDP.学生成绩等,发现这些被预测的变量都属于连续型变量.然而有些情况下,被预测变量可能是二元变量,即成功或失败.流失或 ...

  4. 简单介绍一下R中的几种统计分布及常用模型

    统计学上分布有很多,在R中基本都有描述.因能力有限,我们就挑选几个常用的.比较重要的简单介绍一下每种分布的定义,公式,以及在R中的展示. 统计分布每一种分布有四个函数:d――density(密度函数) ...

  5. Machine Learning for hackers读书笔记(六)正则化:文本回归

    data<-'F:\\learning\\ML_for_Hackers\\ML_for_Hackers-master\\06-Regularization\\data\\' ranks < ...

  6. 统计学习导论:基于R应用——第五章习题

    第五章习题 1. 我们主要用到下面三个公式: 根据上述公式,我们将式子化简为 对求导即可得到得到公式5-6. 2. (a) 1 - 1/n (b) 自助法是有有放回的,所以第二个的概率还是1 - 1/ ...

  7. 统计学习导论:基于R应用——第四章习题

    第四章习题,部分题目未给出答案 1. 这个题比较简单,有高中生推导水平的应该不难. 2~3证明题,略 4. (a) 这个问题问我略困惑,答案怎么直接写出来了,难道不是10%么 (b) 这个答案是(0. ...

  8. R与数据分析旧笔记(⑨)广义线性回归模型

    广义线性回归模型 广义线性回归模型 例题1 R.Norell实验 为研究高压电线对牲畜的影响,R.Norell研究小的电流对农场动物的影响.他在实验中,选择了7头,6种电击强度, 0,1,2,3,4, ...

  9. logistic回归和probit回归预测公司被ST的概率(应用)

    1.适合阅读人群: 知道以下知识点:盒状图.假设检验.逻辑回归的理论.probit的理论.看过回归分析,了解AIC和BIC判别准则.能自己跑R语言程序 2.本文目的:用R语言演示一个相对完整的逻辑回归 ...

随机推荐

  1. DCOS(centos 7.4/7.6)

    https://dcos.io/releases/ https://downloads.dcos.io/dcos/stable/1.12.0/dcos_generate_config.sh https ...

  2. js 文件系统API操作示例

    最近有个需求是:自动抓取某网站登录页面的验证码图片并保存,抓取n次.使用chrome插件来实现,其中使用到了js操作文件系统的api,特将代码记录下来,以备查阅. PS:第一次使用js文件系统的api ...

  3. vue/webpack的一些小技巧

    都知道我比较懒,今天给大家分享的就是如何让自己省事. 一.vue修改打包后的结构(不知道怎么描述合理,看效果图) /config/index.js 默认的: 修改的:(顺手修改了打包后的文件名) 这样 ...

  4. UVA-11761-马尔可夫/记忆化搜索

    https://vjudge.net/problem/UVA-11762 给出一个整数n,每次随机挑选一个小于等于n的素数,如果是n的因子,n变为n/x ,否则不变,问n变为1的期望挑选次数. f[i ...

  5. 002——php字符串中的处理函数(一)

    <?php /** * 字符串处理函数: * 一.PHP处理字符串的空格: * strlen 显示字符串长度 * * trim 对字符串左右空格删除: * ltrim 对字符串左侧空格删除 * ...

  6. MongoDB驱动程序快速入门

    http://mongodb.github.io/mongo-java-driver/3.6/driver/getting-started/quick-start/

  7. CentOS7 Could not retrieve mirrorlist http://mirrorlist.centos.org/?...

    在执行命令 sudo yum clean expire-cache 清理完过期的缓存后,再执行yum install 或 update命令都失败了.原因是清理过期缓存结果不该被清理的也删掉了,可能是y ...

  8. Hadoop的概念、版本、发展史

    Hadoop是什么?Hadoop的起源Hadoop发展史Hadoop的四大特性(优点)Hadoop的版本如何选择Hadoop版本 Hadoop是什么? Hadoop: 适合大数据的分布式存储和计算平台 ...

  9. Microsoft Office 2010 Service Pack 2 发布更新

    Microsoft Office 2010 32 位版本 的 Service Pack 2 (SP2) 包含的新更新可以提高安全性.性能和稳定性.此外,SP 是以前发布的所有更新的汇总. 下载更新补丁 ...

  10. 用于调试的printf函数和自定义log函数

    1. 用宏定义调试用的DPRINT #define DEBUG_ENABLE #ifdef DEBUG_ENABLE #define DPRINT(fmt, args...) fprintf(stde ...