Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")

library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
          y = factor(BRCA.rnaseq[, 1]),
          family = "binomial",
          type.measure = "class",
          parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
   coef(cvfit, s = "lambda.min")[, 1] != 0)
   )[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自：http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

[R] venn.diagram保存pdf格式文件？
vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...
VennDiagram 画文氏图/维恩图/Venn
install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...
R绘制韦恩图 | Venn图
解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔酷炫版,VennDiagram 特别注意: ...
sql的各种join连接
SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...
.NET 框架（转自wiki）
.NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...
Python画图笔记
matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示 h ...
哪些问题困扰着我们？DevOps 使用建议
[编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...
Transparency Tutorial with C# - Part 1
Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...
data mining，machine learning，AI，data science，data science，business analytics
数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

随机推荐

WebGL 创建和初始化着色器过程
1.编译GLSL ES代码,创建和初始化着色器供WebGL使用.这些过程一般分为7个步骤: 创建着色器对象(gl.createBuffer()); 向着色器对象中填充着色器程序的源代码(gl.shad ...
STM32固件库文件分析
STM32固件库文件分析 1.汇编编写的启动文件 startup/stm32f10x.hd.s:设置堆栈指针,设置pc指针,初始化中断向量,配置系统时钟,对用c库函数_main最后去c语言世界里. 2 ...
腾讯云上PhantomJS用法示例
崔庆才前言大家有没有发现之前我们写的爬虫都有一个共性,就是只能爬取单纯的html代码,如果页面是JS渲染的该怎么办呢?如果我们单纯去分析一个个后台的请求,手动去摸索JS渲染的到的一些结果,那简直没 ...
LINUX ON AZURE 安全建议（全）
本文为个人原创,可以自由转载,转载请注明出处,多谢! 本文地址:http://www.cnblogs.com/taosha/p/6399554.html 1.网络与安全规划 Azure 虚拟网络 (V ...
TaintDroid简介
1.Information-Flow tracking,Realtime Privacy Monitoring.信息流动追踪,实时动态监控. 2.TaintDroid是一个全系统动态污点跟踪和分析系统 ...
手机QQ无法临时会话的解决方案
手机网页发起临时会话: <a href="mqqwpa://im/chat?chat_type=wpa&uin=3355135984&version=1& ...
java面试题（二）
21.描述一下JVM加载class文件的原理机制? 答:JVM中类的装载是由类加载器(ClassLoader)和它的子类来实现的,Java中的类加载器是一个重要的Java运行时系统组件,它负责在运行时 ...
HTML在网页中插入音频视频简单的滚动效果
每次上网,打开网页后大家都会看到在网页的标签栏会有个属于他们官网的logo,现在学了HTML了,怎么不会制作这个小logo呢,其实很简单,也不需要死记硬背,每当这行代码出现的时候能知道这是什么意思就o ...
java异常处理机制(try-catch-finally)
/* * 异常处理机制 * 1.分类:Error和Exception * Error错误是JVM自动报错的,程序员无法解决例如开数组过大int a[]=new int [1024*1024*1024] ...
Spring事务执行过程
先说一下启动过程中的几个点: 加载配置文件: AbstractAutowireCapableBeanFactory.doCreateBean --> initializeBean --> ...

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

RTCGA data

GLMnet

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

随机推荐

热门专题