State of Hyperparameter Selection
State of Hyperparameter Selection
Historically hyperparameter determination has been a woefully forgotten aspect of machine learning. With the rise of neural nets - which require more hyperparameters, more precisely tuned than many other models - there has been a recent surge of interest in intelligent methods for selection; however, the average practitioner still seems to commonly use either default hyperparameters, grid search, random search, or (believe it or not) manual search.
For the readers who don’t know, hyperparameter selection boils down to a conceptually) simple problem: you have a set of variables (your hyperparameters) and an objective function (a measure of how good your model is). As you add hyperparameters, the search space of this problem explodes.
Grid Search is (often) Stupid
One method for finding optimal hyperparameters is grid search (divide the space into even increments and test them exhaustively).
When presented with the above plot, a human would instinctively detect the pattern present, and, if looking for the lowest point, would be able to make an intelligent guess on where to begin. Most would not choose to evenly divide the space into a grid and test every point, yet this is precisely what grid search does. Humans have a fundamental intuition from looking at that image that there are areas the minimum is more likely to be. By exhaustively searching the space, you’re wasting your time on the areas where the function is obviously (excluding an improbable fluke) not going to be at its minimum and ignoring any information you have from the points you already know.
Random Search Isn’t Much Better
The next most common method is random search, which is exactly what it sounds like. Given the same plot, I doubt anybody would decide to pick random points. Random search is not quite stupid - anybody who has studied statistics knows the power of randomness from techniques like bootstrapping or monte carlo. In fact, randomly picking parameters often outperforms grid search. Random hyperparameters can find points that grid search would either skip over (if the granularity of the grid was too coarse) or cost a tremendous amount to find (as the grid becomes finer). It can similarly outperform manual search in some situations: a human will generally focus on the domain in which they have seen the lowest points, whereas random search finds new domains neglected by intuition.
A Different Kind of Optimization
Outside of cases where finding the absolute global minimum is required and an exhaustive search is necessary, grid search could only really reasonably be used in situations where the cost of evaluating the objective function is so low it could be considered to be non-existent, and even in those cases the only excuse for implementing it is laziness (i.e. it costs more to implement / run a sophisticated method than to perform grid search). In these cases grid search is preferred to random search because with a fine enough grid search you are guaranteed to find a near optimal point, whereas random search offers no such guarantee.
But what if you are in a situation where finding every point is too costly? Training models is often expensive, both in time and computational power, and this expense skyrockets with the increased data and model complexity. What we need is an intelligent way to traverse our parameter space while searching for a minimum.
Upon first impression, this might seem like an easy problem. Finding the minimum of an objective function is pretty much all we ever do in machine learning, right? Well here’s the rub: you don’t know this function. Fitting a regression you choose your cost function. Training a neural net you choose your activation function. If you know what these functions are, then you know their derivatives (assuming you picked differentiable functions); this means you know which direction points “down”. This knowledge is fundamental to most optimization techniques, like momentum or stochastic gradient descent.
There are other techniques, like binary search or golden ratio search, which don’t require the derivative directly but require the knowledge your objective is unimodal - any local, non-global minima have potential to make this search entirely ineffective. Yet other optimization methods do not depend upon any knowledge of the underlying function (simulated annealing, coordinate descent) but require a large number of samples from the objective function to find an approximate minimum.
So the question is, what do you do when the cost of evaluation is high? How do we intelligently guess where the minimums are likely to be in a high dimensional space? We need a method which doesn’t waste time on points where there isn’t expected return, but also won’t get caught in local minima.
Being Smart with Hyperparameters
So now that we know how bad grid search and random search are, the questions is how can we do better? Here we discuss one technique for intelligent hyperparameter search, known as bayesian optimization. We will now cover the concept of how this technique can be used to traverse hyperparameter space; there is an associated IPython Notebookwhich further illustrates the practice.
Bayesian Optimization on a Conceptual Level
The basic idea is this: you assume there is a smooth response surface (i.e. the curve of your objective function) connecting all your known hyperparameter-objective function points, then you model all of these possible surfaces. Averaging over these surfaces we acquire an expected objective function value for any given point and an associated variance (in more mathematical terms, we are modeling our response as a Gaussian Process). The lowest the variance on this surface dips is the point of highest ‘expected improvement’. This is a black-box model; we need no actual knowledge of the underlying process producing the response curve.
This concept is illustrated clearly in Yelp’s MOE documentation for the case of a single hyperparameter. On top, you see our objective function response surface when we have no previous data points. It is flat, as we don’t yet know anything. We can see that the variance (the shaded region) is also flat. In the bottom plot we see the maximum Expected Improvement (the lowest the variance dips). This plot is also flat, so our point of highest Expected Improvement is random.
Next we acquire a single data value, effectively pinning our expectation and collapsing its variance around a given point. Everywhere else the objective function remains unknown, but we are modeling the response surface as smooth. Additionally, you can see how the variance of our sample point can easily be incorporated into this model - the variance simply isn’t pinched quite as tightly.
We can see that the objective function value for our acquired point is high (which for this example we will say is undesirable). We pick our next point to sample as far from here as possible.
We’ve now ‘pinched’ our response curve on both sides, and begin to explore the middle. However, since the lower hyperparameter value had a lower objective value, we will favor towards lower hyperparameter values. The red line above shows the point of maximum expected improvement, i.e. our next point to sample.
Now that we’ve pinched the middle of the curve, we have a choice to make - exploration or exploitation. You can see this trade-off is automatically made in our model - where the modeled variance dips the lowest is where our highest expected improvement point lies (the one dimensional example isn’t ideal for illustrating this, but you can imagine in more dimensions having large unexplored domains and the need to balance between exploiting better points near the low points you have and exploring these unknown domains).
If you have the capability to carry out multiple evaluations of the response curve in parallel (i.e. can train multiple models at once), a simple approach for sampling multiple points would be to assume the expected response curve value for your current point and sample a new point based upon that. When you get the actual values back, you update your model and keep sampling.
Our hyper-hyperparameter, the variance of the gaussian process, is actually very important in determining exploration vs. exploitation. Below you can see two examples of identical expected response surfaces where the variance magnitude (1 on the left, 5 on the right) which give different next points to sample (note that the scale of the y-axis has changed). The greater variance is set the more the model favors exploration and the lower it is set the more the model favors exploitation.
With bayesian optimization, in the worst case (when you have no history) you get random search. As you gain information, it becomes less random and more intelligent, picking points where the maximum improvement is expected, trading off between finding absolute minima around previously sampled points and exploring new domains.
There are two prominent open source packages which implement bayesian optimization: the above mentioned MOE (Metric Optimization Engine, produced by Yelp and the source of all of the pretty pictures featured above) and Spearmint (from the Harvard research group HIPS). These packages are so easy to use (see the attached IPython Notebook) that there’s practically no reason not to implement them on every hyperparameter search you perform (the argument that they take computing power to run themselves is valid; however, the computing cost of either is often negligible compared to that of training almost any non-toy model).
So don’t waste your time looking places which won’t yield results. And don’t search randomly when you can search intelligently.
A Note on Overfitting
As always, by tweaking based on a function of your data, there is a danger of overfitting. The two easiest ways to avoid overfitting your hyperparameters are to either tune your hyperparameters to an in-sample metric or to keep a third data split for final validation. Additionally, regularization can always be used to bound the potential complexity of your model.
Footnote: Animated plots of MOE exploring various objective functions to find the minimum
- "Learning to Interact" by John Langford. Excellent summary of contextual bandits and exploration http://t.co/9vupPzvM82 #machinelearning
- State of Hyperparameter Selection by @danielsaltiel http://t.co/dR4gNQ2RIJ #machinelearning http://t.co/oxgRfIZKZu
- Intro to Deep Learning with Theano and OpenDeep by @mbeissinger https://t.co/NNxytB4niK #deeplearning #machinelearning @OpenDeep
- Great introduction to Convolutional Neural Networks by Aysegul Dundar https://t.co/knO3eEOidS #deeplearning #machinelearning
- RT @Orange_SV: Join us & our panel of experts on 6/10 as we explore #blockchain & its uses. http://t.co/OYwQkPiAaX #cryptocurrencyhttp://t.co/td48brTICt
- Over 92% of Stack Overflow questions about expert topics are answered — in a median time of 11 minutes http://t.co/PQYjDiht2K #datascience
- Excited to participate in "Putting the Blockchain to Work" event by @Orange_SV on 6/10 https://t.co/o2ksOK0V1i #blockchain#machinelearning
- Come to our first-ever Open House Jun 18. Meet our fellows & join #MachineLearning discussion http://t.co/hSmEeq5IJi http://t.co/aYEDCn3dF0
- My @Quora answer to What are applications of data science and machine learning in the media & entertainment industry?http://t.co/QMjNe4bB1H
State of Hyperparameter Selection的更多相关文章
- ICLR 2013 International Conference on Learning Representations深度学习论文papers
ICLR 2013 International Conference on Learning Representations May 02 - 04, 2013, Scottsdale, Arizon ...
- [amazonaccess 1]logistic.py 特征提取
---恢复内容开始--- 本文件对应logistic.py amazonaccess介绍: 根据入职员工的定位(员工角色代码.角色所属家族代码等特征)判断员工是否有访问某资源的权限 logistic. ...
- ViewPageAsImage
var ViewPageAsImage = function(target, label) { var setting = { min_height: 4, min_width: 4 ...
- BAT 前端开发面试 —— 吐血总结
更好阅读,请移步这里 聊之前 最近暑期实习招聘已经开始,个人目前参加了腾讯和阿里的内推及百度的实习生招聘,在此总结一下 一是备忘.总结提升,二是希望给大家一些参考 其他面试及基础相关可以参考其他博文: ...
- Why validation set ?
Let's assume that you are training a model whose performance depends on a set of hyperparameters. In ...
- AutoML相关论文
本文为Awesome-AutoML-Papers的译文. 1.AutoML简介 Machine Learning几年来取得的不少可观的成绩,越来越多的学科都依赖于它.然而,这些成果都很大程度上取决于人 ...
- Extjs学习笔记--(一vs增加extjs智能感知)
1,编写class.js var classList=[ "Ext.layout.container.Absolute", "Ext.layout.container.A ...
- BAT 前端开发面经 —— 吐血总结
更好阅读,请移步这里 聊之前 最近暑期实习招聘已经开始,个人目前参加了阿里的内推及腾讯和百度的实习生招聘,在此总结一下 一是备忘.总结提升,二是希望给大家一些参考 其他面试及基础相关可以参考其他博文: ...
- BAT 前端开发面经 —— 吐血总结 前端相关片段整理——持续更新 前端基础精简总结 Web Storage You don't know js
BAT 前端开发面经 —— 吐血总结 目录 1. Tencent 2. 阿里 3. 百度 更好阅读,请移步这里 聊之前 最近暑期实习招聘已经开始,个人目前参加了阿里的内推及腾讯和百度的实习生招聘, ...
随机推荐
- HTML特殊字符编码大全
HTML特殊字符编码大全:往网页中输入特殊字符,需在html代码中加入以&开头的字母组合或以&#开头的数字.下面就是以字母或数字表示的 特殊符号大全 ´ © © > > µ ...
- (转)国内外三个不同领域巨头分享的Redis实战经验及使用场景
随着应用对高性能需求的增加,NoSQL逐渐在各大名企的系统架构中生根发芽.这里我们将为大家分享社交巨头新浪微博.传媒巨头Viacom及图片分享领域佼佼者Pinterest带来的Redis实践,首先我们 ...
- 关于document.write
document.write的用处 document.write是JavaScript中对document.open所开启的文档流(document stream操作的API方法,它能够直接在文档流中 ...
- think in java 第四版读书笔记 第一章对象导论
很久没有碰过java了,为了项目需要以及以后找工作,还是有必要将think in java通读一遍.欢迎大家一起讨论学习 1.1抽象过程 面向对象语言的5个特性: 1.万物皆对象 任何事物都可以抽象为 ...
- 郑州轻工业OJ1400--这不可能是情书吧
地址:http://acm.zzuli.edu.cn/problem.php?id=1400 #include<stdio.h> #include<string.h> #inc ...
- 洛谷 P1209 修理牛棚== Codevs 2079 修理牛棚
时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题目描述 Description 在一个夜黑风高,下着暴风雨的夜晚,farmer John的牛棚的屋顶.门被吹 ...
- Windows Phone 7 中拷贝文件到独立存储
private void CopyToIsolatedStorage(){ using (IsolatedStorageFile storage = IsolatedStorageFile.Ge ...
- background-clip 背景图片做适当的裁剪
background-clip 用来将背景图片做适当的裁剪以适应实际需要. 语法: background-clip : border-box | padding-box | content-box | ...
- Javascript this 解析
Javascript中,this是一个非常有用的关键字, this是在运行时基于函数的运行环境绑定的,但是,如果使用的时候不注意,很容易就出错了. ECMAScript Standard对this的定 ...
- Shell脚本升级CentOS php版本v
#! /bin/sh #1.关闭selinuxcp -rp /etc/selinux/config /etc/selinux/config.baksetenforce 0sed -i '7s/enfo ...