The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (cvAlpha.glmnet) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.

The formula interface

The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:

  1. mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars)
  2.  
  3. ## Call:
  4. ## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars)
  5. ##
  6. ## Model fitting options:
  7. ## Sparse model matrix: FALSE
  8. ## Use model.frame: FALSE
  9. ## Alpha: 1
  10. ## Lambda summary:
  11. ## Min. 1st Qu. Median Mean 3rd Qu. Max.
  12. ## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500

Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments thatpredict.glmnet needs.

  1. # least squares regression: get predictions for lambda=1
  2. predict(mtcarsMod, newdata=mtcars, s=1)

Building the model matrix

You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.

The standard R method for creating a model matrix out of a data frame uses the model.framefunction, which has a major disadvantage when it comes to wide data. It generates a termsobject, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size p × p, where p is the number of variables in the model. When p > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.

Another issue with the standard approach is the treatment of factors. Normally, model.matrixwill turn an N-level factor into an indicator matrix with N−1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of Ncolumns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.

To deal with these problems, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually). Machine learners may also recognise this as one-hot encoding.

glmnetUtils can also generate a sparse model matrix, using the sparse.model.matrix function provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.

Crossvalidation for α

One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to how cv.glmnet chooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the cvAlpha.glmnet function, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:

  1. # Leukemia dataset from Trevor Hastie's website:
  2. # http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
  3. load("~/Leukemia.rdata")
  4. leuk <- do.call(data.frame, Leukemia)
  5.  
  6. cvAlpha.glmnet(y ~ ., data=leuk, family="binomial")
  7.  
  8. ## Call:
  9. ## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
  10. ##
  11. ## Model fitting options:
  12. ## Sparse model matrix: FALSE
  13. ## Use model.frame: FALSE
  14. ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
  15. ## Number of crossvalidation folds for lambda: 10

cvAlpha.glmnet uses the algorithm described in the help for cv.glmnet, which is to fix the distribution of observations across folds and then call cv.glmnet in a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the outerParallel argument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:

  • Via parLapply in the parallel package. To use this, set outerParallel to a valid cluster object created bymakeCluster.
  • Via rxExec as supplied by Microsoft R Server’s RevoScaleR package. To use this, setouterParallel to a valid compute context created by RxComputeContext, or a character string specifying such a context.

Conclusion

The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:

  1. library(devtools)
  2. install_github("hong-revo/glmnetUtils")

A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me athongooi@microsoft.com.

转自:http://blog.revolutionanalytics.com/2016/11/glmnetutils.html

glmnetUtils: quality of life enhancements for elastic net regression with glmnet的更多相关文章

  1. wlan的QOS配置

    WLAN QoS配置 1.1  WLAN QoS简介 802.11网络提供了基于竞争的无线接入服务,但是不同的应用需求对于网络的要求是不同的,而原始的网络不能为不同的应用提供不同质量的接入服务,所以已 ...

  2. L1和L2特征的适用场景

    How to decide which regularization (L1 or L2) to use? Is there collinearity among some features? L2 ...

  3. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  4. Kaggle实战之一回归问题

    0. 前言 1.任务描述 2.数据概览 3. 数据准备 4. 模型训练 5. kaggle实战 0. 前言 "尽管新技术新算法层出不穷,但是掌握好基础算法就能解决手头 90% 的机器学习问题 ...

  5. Google云平台使用方法 | Hail | GWAS | 分布式回归 | LASSO

    参考: Hail Hail - Tutorial  windows也可以安装:Spark在Windows下的环境搭建 spark-2.2.0-bin-hadoop2.7 - Hail依赖的平台,并行处 ...

  6. Overfitting & Regularization

    Overfitting & Regularization The Problem of overfitting A common issue in machine learning or ma ...

  7. stacking method house price in kaggle top10%

    整合几部分代码的汇总 隐藏代码片段 导入python数据和可视化包 导入统计相关的工具 导入回归相关的算法 导入数据预处理相关的方法 导入模型调参相关的包 读取数据 特征工程 缺失值 类别特征处理-l ...

  8. Java Programming Language Enhancements

    引用:Java Programming Language Enhancements Java Programming Language Enhancements Enhancements in Jav ...

  9. EasyMesh - A Two-Dimensional Quality Mesh Generator

    EasyMesh - A Two-Dimensional Quality Mesh Generator eryar@163.com Abstract. EasyMesh is developed by ...

随机推荐

  1. jmeter参数化随机取值实现

    jmeter能用来做参数化的组件有几个,但是都没有随机取值的功能,遇到随机取值的需求怎么办呢? 突发奇想,可以用函数__CSVRead()来实现: __CSVRead() CSV file to ge ...

  2. 事件驱动的简明讲解(python实现)

    关键词:编程范式,事件驱动,回调函数,观察者模式 作者:码匠信龙 举个简单的例子: 有些人喜欢的某个公众号,然后去关注这个公众号,哪天这个公众号发布了篇新的文章,没多久订阅者就会在微信里收到这个公众号 ...

  3. Unity3D 正六边形,环状扩散,紧密分布,的程序

    最近在做一个正六边形的游戏,被一开始的布局难倒了. 需求:中心有个正六边形,输入围绕中心扩散的环数,自动创建和摆放. 大概就是这样的吧,我觉得这个非常轻松的就可以搞定了.啊~~~~~啊~~~ 五环~~ ...

  4. sptt规范介绍

    相关资源 如何开发sptt工程的原子操作 移动端测试方案--sptt sptt规范 一个标准的sptt工程的目录如下: [sptt-project] | -- [ios] | | -- [atoms] ...

  5. h5 实现调用系统拍照或者选择照片并预览

    这次又来分享个好东西! 调用手机相机拍照或者是调用手机相册选择照片,这个功能在 手机端页面 或者 webApp 应该是常用到的,就拿个人或会员资料录入那块来说就已经是经常会碰到的, 每当看到这块功能的 ...

  6. 【转】JDBC学习笔记(2)——Statement和ResultSet

    转自:http://www.cnblogs.com/ysw-go/ Statement执行更新操作 Statement:Statement 是 Java 执行数据库操作的一个重要方法,用于在已经建立数 ...

  7. vue+websocket+express+mongodb实战项目(实时聊天)

    继上一个项目用vuejs仿网易云音乐(实现听歌以及搜索功能)后,发现上一个项目单纯用vue的model管理十分混乱,然后我去看了看vuex,打算做一个项目练练手,又不想做一个重复的项目,这次我就放弃颜 ...

  8. Fill-rate, Canvases and input 【译】

    翻译自https://unity3d.com/cn/learn/tutorials/topics/best-practices/fill-rate-canvases-and-input?playlis ...

  9. AspNetCore-MVC实战系列(二)之通过绑定邮箱找回密码

    AspNetCore - MVC实战系列目录 . 爱留图网站诞生 . AspNetCore - MVC实战系列(一)之Sqlserver表映射实体模型 . AspNetCore-MVC实战系列(二)之 ...

  10. bzoj4766 文艺计算姬

    Description "奋战三星期,造台计算机".小W响应号召,花了三星期造了台文艺计算姬.文艺计算姬比普通计算机有更多的艺术细胞.普通计算机能计算一个带标号完全图的生成树个数, ...