Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
y = factor(BRCA.rnaseq[, 1]),
family = "binomial",
type.measure = "class",
parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
coef(cvfit, s = "lambda.min")[, 1] != 0)
)[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的更多相关文章

  1. [R] venn.diagram保存pdf格式文件?

    vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names ...

  2. VennDiagram 画文氏图/维恩图/Venn

    install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(2 ...

  3. R绘制韦恩图 | Venn图

    解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 特别注意: ...

  4. sql的各种join连接

    SELECT * FROM TableA INNER JOIN TableB ON TableA.name = TableB.name id name id name -- ---- -- ---- ...

  5. .NET 框架(转自wiki)

    .NET Framework (pronounced dot net) is a software framework developed by Microsoft that runs primari ...

  6. Python画图笔记

    matplotlib的官方网址:http://matplotlib.org/ 问题 Python Matplotlib画图,在坐标轴.标题显示这五个字符 ⊥ + - ⊺ ⨁,并且保存后也能显示   h ...

  7. 哪些问题困扰着我们?DevOps 使用建议

    [编者按]随着 DevOps 被欲来越多机构采用,一些共性的问题也暴露出来.近日,Joe Yankel在「Devops Q&A: Frequently Asked Questions」一文中总 ...

  8. Transparency Tutorial with C# - Part 1

    Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 ...

  9. data mining,machine learning,AI,data science,data science,business analytics

    数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics ...

随机推荐

  1. c#控制台实现post网站登录

    如题,首先我写了一个web页面来实现post登陆,只做演示,代码如下 public void ProcessRequest(HttpContext context) { context.Respons ...

  2. 表格组件神器:bootstrap table详细使用指南

    1.bootstrap-table简介 1.1.bootstrap table简介及特征: Bootstrap table是国人开发的一款基于 Bootstrap 的 jQuery 表格插件,通过简单 ...

  3. protected 学习

    virtual是把一个方法声明为虚方法,使派生类可重写此方法,一般建立的方法是不能够重写的,譬如类A中有个方法protected void method(){ 原代码....;}类B继承自类A,类B能 ...

  4. JS作用域相关知识(#精)

    在学习<你不知道的JS>一书中,特将作用域相关知识在此分享一下: #说到作用域,就不得不提到LHS查询和RHS查询: 1)如果查询目的是对变量进行赋值,则使用LHS查询 2)如果查询目的是 ...

  5. java 集合框架(List操作)

    /*list 基本操作 * * List a=new List(); * 增 * a.add(index,element);按指定位置添加,其余元素依次后移 * addAll(index,Collec ...

  6. rapidPHP 下载并安装

    安装 rapidPHP对运行环境的要求 php 5.4以上,包括5.4,支持php7,依赖包,php-curl,php-mysql,php-gd 官网下载 http://rapidphp.gx521. ...

  7. Centos 7 安装mysql后出现 ERROR 2002 (HY000)解决方案

    Centos 7 安装mysql后出现 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib ...

  8. stl_config.h基本宏

    四.宏: (其实呢, 我们所有的宏都包含在了 "stl_config.h"头文件中.) //这些宏是怎么判断是否需要定义:是否有指定的宏,还有一些特定的编译器也可能支持. 4.1. ...

  9. 使用Java POI来选择提取Word文档中的表格信息

    通过使用Java POI来提取Word(1992)文档中的表格信息,其中POI支持不同的ms文档类型,在具体操作中需要注意.本文主要是通过POI来提取微软2003文档中的表格信息,具体code如下(事 ...

  10. Python中使用with语句同时打开多个文件

    下午小伙伴问了一个有趣的问题, 怎么用 Python 的 with 语句同时打开多个文件? 首先, Python 本身是支持同时在 with 中打开多个文件的 with open('a.txt', ' ...