Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms
Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.
I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.
RTCGA data
Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
The dependent variable, bcr_patient_barcode
, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).
Check another RTCGA use case: TCGA and The Curse of BigData.
GLMnet
Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.
library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept
Function coef
extracts coefficients for fitted model. Argument s
specifies for which regularization parameter we would like to extract them - lamba.min
is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se
.
plot(cvfit)
Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.
转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html
Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章
- [R] venn.diagram保存pdf格式文件?
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
- VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
- R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...
- sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
- .NET 框架(转自wiki)
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
- Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
- 哪些问题困扰着我们?DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
- Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
- data mining,machine learning,AI,data science,data science,business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...
随机推荐
- 基于Flink的windows--简介
新的一年,新的开始,新的习惯,现在开始. 1.简介 Flink是德国一家公司名为dataArtisans的产品,2016年正式被apache提升为顶级项目(地位同spark.storm等开源架构).并 ...
- 关于DCL的使用
DCL1 创建用户语法:CREATE USER 用户名@地址 IDENTIFIED BY '密码';CREATE USER user1@localhost IDENTIFIED BY '123'; C ...
- 关于mpu6050的几个很好的帖子
最近在研究6050,真是很磨人啊,这个小东西还挺复杂,一个读取程序竟然需要600多行. 这几天连查资料找到了几个很好的帖子,要是以后有人看到这篇帖子,可以避免误入歧途,也可以省去很多时间. 1.阿西莫 ...
- LVS+Keepalived实现DBProxy的高可用
背景 在上一篇文章美团点评DBProxy读写分离使用说明实现了读写分离,但在最后提了二个问题:一是代理不管MySQL主从的复制状态,二是DBProxy本身是一个单点的存在.对于第一个可以通过自己定义的 ...
- IOC容器的依赖注入
1.依赖注入发生的时间 当Spring IoC容器完成了Bean定义资源的定位.载入和解析注册以后,IoC容器中已经管理类Bean定义的相关数据,但是此时IoC容器还没有对所管理的Bean进行依赖注入 ...
- 工资不高也要给自己放假 这几款APP估计你用得上
我是这样的一个人,我宁愿工资不高,只要给我足够的假期,那我就满足了.都说上班就是为了赚钱,但如果身体不好,赚再多的钱也是无福享受,所以建议各位,有机会的话,一定要抽出时间去旅游,去放松. 现在我们外出 ...
- JS模式--状态模式(状态机)
下面的状态机选择通过Function.prototype.call方法直接把请求委托给某个字面量对象来执行. var light = function () { this.currstate = FS ...
- sql关键字之null
在数据库中使用一种特殊的值表示未知的值--NULL,我们称之为空值但并不是空的字符串,而是特殊的值.
- laytpl--前端数据绑定
发现一枚前端数据绑定导弹:laytpl,官网:http://www.layui.com/laytpl/ 为了不用angularJS等较为重量级的,和繁琐的配置,所以就用了laytpl,可以配合JQ使用 ...
- ajax获取数据后怎么去渲染到页面?
$.ajax({ url:"apiAttachmentAction_uploadAttachment.action", type:"post", data:fo ...