Stochastic Optimization Techniques

Neural networks are often trained stochastically, i.e. using a method where the objective function changes at each iteration. This stochastic variation is due to the model being trained on different data during each iteration. This is motivated by (at least) two factors: First, the dataset used as training data is often too large to fit in memory and/or be optimized over efficiently. Second, the objective function is typically nonconvex, so using different data at each iteration can help prevent the model from settling in a local minimum. Furthermore, training neural networks is usually done using only the first-order gradient of the parameters with respect to the loss function. This is due to the large number of parameters present in a neural network, which for practical purposes prevents the computation of the Hessian matrix. Because vanilla gradient descent can diverge or converge incredibly slowly if its learning rate hyperparameter is set inappropriately, many alternative methods have been proposed which are intended to produce desirable convergence with less dependence on hyperparameter settings. These methods often effectively compute and utilize a preconditioner on the gradient, adaptively change the learning rate over time or approximate the Hessian matrix.

In the following, we will use $\theta_t$ to denote some generic parameter of the model at iteration $t$, to be optimized according to some loss function $\mathcal{L}$ which is to be minimized.

Stochastic Gradient Descent

Stochastic gradient descent (SGD) simply updates each parameter by subtracting the gradient of the loss with respect to the parameter, scaled by the learning rate $\eta$, a hyperparameter. If $\eta$ is too large, SGD will diverge; if it's too small, it will converge slowly. The update rule is simply $$ \theta_{t + 1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t) $$

Momentum

In SGD, the gradient $\nabla \mathcal{L}(\theta_t)$ often changes rapidly at each iteration $t$ due to the fact that the loss is being computed over different data. This is often partially mitigated by re-using the gradient value from the previous iteration, scaled by a momentum hyperparameter $\mu$, as follows:

\begin{align*} v_{t + 1} &= \mu v_t - \eta \nabla \mathcal{L}(\theta_t) \\ \theta_{t + 1} &= \theta_t + v_{t+1} \end{align*}

It has been argued that including the previous gradient step has the effect of approximating some second-order information about the gradient.

Nesterov's Accelerated Gradient

In Nesterov's Accelerated Gradient (NAG), the gradient of the loss at each step is computed at $\theta_t + \mu v_t$ instead of $\theta_t$. In momentum, the parameter update could be written $\theta_{t + 1} = \theta_t + \mu v_t - \eta \nabla \mathcal{L}(\theta_t)$, so NAG effectively computes the gradient at the new parameter location but without considering the gradient term. In practice, this causes NAG to behave more stably than regular momentum in many situations. A more thorough analysis can be found in 1). The update rules are then as follows:

\begin{align*} v_{t + 1} &= \mu v_t - \eta \nabla\mathcal{L}(\theta_t + \mu v_t) \\ \theta_{t + 1} &= \theta_t + v_{t+1} \end{align*}

Adagrad

Adagrad effectively rescales the learning rate for each parameter according to the history of the gradients for that parameter. This is done by dividing each term in $\nabla \mathcal{L}$ by the square root of the sum of squares of its historical gradient. Rescaling in this way effectively lowers the learning rate for parameters which consistently have large gradient values. It also effectively decreases the learning rate over time, because the sum of squares will continue to grow with the iteration. After setting the rescaling term $g = 0$, the updates are as follows: \begin{align*} g_{t + 1} &= g_t + \nabla \mathcal{L}(\theta_t)^2 \\ \theta_{t + 1} &= \theta_t - \frac{\eta\nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*} where division is elementwise and $\epsilon$ is a small constant included for numerical stability. It has nice theoretical guarantees and empirical results 2) 3).

RMSProp

In its originally proposed form 4), RMSProp is very similar to Adagrad. The only difference is that the $g_t$ term is computed as a exponentially decaying average instead of an accumulated sum. This makes $g_t$ an estimate of the second moment of $\nabla \mathcal{L}$ and avoids the fact that the learning rate effectively shrinks over time. The name “RMSProp” comes from the fact that the update step is normalized by a decaying RMS of recent gradients. The update is as follows:

\begin{align*} g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ \theta_{t + 1} &= \theta_t - \frac{\eta\nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t + 1}} + \epsilon} \end{align*}

In the original lecture slides where it was proposed, $\gamma$ is set to $.9$. In 5), it is shown that the $\sqrt{g_{t + 1}}$ term approximates (in expectation) the diagonal of the absolute value of the Hessian matrix (assuming the update steps are $\mathcal{N}(0, 1)$ distributed). It is also argued that the absolute value of the Hessian is better to use for non-convex problems which may have many saddle points.

Alternatively, in 6), a first-order moment approximator $m_t$ is added. It is included in the denominator of the preconditioner so that the learning rate is effectively normalized by the standard deviation $\nabla \mathcal{L}$. There is also a $v_t$ term included for momentum. This gives

\begin{align*} m_{t + 1} &= \gamma m_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t) \\ g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ v_{t + 1} &= \mu v_t - \frac{\eta \nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t+1} - m_{t+1}^2 + \epsilon}} \\ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}

Adadelta

Adadelta 7) uses the same exponentially decaying moving average estimate of the gradient second moment $g_t$ as RMSProp. It also computes a moving average $x_t$ of the updates $v_t$ similar to momentum, but when updating this quantity it squares the current step, which I don't have any intuition for.

\begin{align*} g_{t + 1} &= \gamma g_t + (1 - \gamma) \nabla \mathcal{L}(\theta_t)^2 \\ v_{t + 1} &= -\frac{\sqrt{x_t + \epsilon} \nabla \mathcal{L}(\theta_t)}{\sqrt{g_{t+1} + \epsilon}} \\ x_{t + 1} &= \gamma x_t + (1 - \gamma) v_{t + 1}^2 \\ \theta_{t + 1} &= \theta_t + v_{t + 1} \end{align*}

Adam

Adam is somewhat similar to Adagrad/Adadelta/RMSProp in that it computes a decayed moving average of the gradient and squared gradient (first and second moment estimates) at each time step. It differs mainly in two ways: First, the first order moment moving average coefficient is decayed over time. Second, because the first and second order moment estimates are initialized to zero, some bias-correction is used to counteract the resulting bias towards zero. The use of the first and second order moments, in most cases, ensure that typically the gradient descent step size is $\approx \pm \eta$ and that in magnitude it is less than $\eta$. However, as $\theta_t$ approaches a true minimum, the uncertainty of the gradient will increase and the step size will decrease. It is also invariant to the scale of the gradients. Given hyperparameters $\gamma_1$, $\gamma_2$, $\lambda$, and $\eta$, and setting $m_0 = 0$ and $g_0 = 0$ (note that the paper denotes $\gamma_1$ as $\beta_1$, $\gamma_2$ as $\beta_2$, $\eta$ as $\alpha$ and $g_t$ as $v_t$), the update rule is as follows: 8)

\begin{align*} m_{t + 1} &= \gamma_1 m_t + (1 - \gamma_1) \nabla \mathcal{L}(\theta_t) \\ g_{t + 1} &= \gamma_2 g_t + (1 - \gamma_2) \nabla \mathcal{L}(\theta_t)^2 \\ \hat{m}_{t + 1} &= \frac{m_{t + 1}}{1 - \gamma_1^{t + 1}} \\ \hat{g}_{t + 1} &= \frac{g_{t + 1}}{1 - \gamma_2^{t + 1}} \\ \theta_{t + 1} &= \theta_t - \frac{\eta \hat{m}_{t + 1}}{\sqrt{\hat{g}_{t + 1}} + \epsilon} \end{align*}

ESGD

9)

Adasecant

10)

vSGD

11)

Rprop

12)

1) Sutskever, Martens, Dahl, and Hinton, “On the importance of initialization and momentum in deep learning” (ICML 2013)
2) Dyer, “Notes on AdaGrad”
3) Duchi, Hazan, and Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization” (COLT 2010)
4) Hinton, Srivastava, and Swersky, “rmsprop: Divide the gradient by a running average of its recent magnitude”
5) , 9) Dauphin, Vries, Chung and Bengion, “RMSProp and equilibrated adaptive learning rates for non-convex optimization”
6) Graves, “Generating Sequences with Recurrent Neural Networks”
7) Zeiler, “Adadelta: An Adaptive Learning Rate Method”
8) Kingma and Ba, “Adam: A Method for Stochastic Optimization”
10) Gulcehre and Bengio, “Adasecant: Robust Adaptive Secant Method for Stochastic Gradient”
11) Schaul, Zhang, LeCun, “No More Pesky Learning Rates”
12) Riedmiller and Bruan, “A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm”

Stochastic Optimization Techniques的更多相关文章

  1. TensorFlow 深度学习笔记 Stochastic Optimization

    Stochastic Optimization 转载请注明作者:梦里风林 Github工程地址:https://github.com/ahangchen/GDLnotes 欢迎star,有问题可以到I ...

  2. ADAM : A METHOD FOR STOCHASTIC OPTIMIZATION

    目录 概 主要内容 算法 选择合适的参数 一些别的优化算法 AdaMax 理论 代码 Kingma D P, Ba J. Adam: A Method for Stochastic Optimizat ...

  3. Stochastic Optimization of PCA with Capped MSG

    目录 Problem Matrix Stochastic Gradient 算法(MSG) 步骤二(单次迭代) 单步SVD \(project()\)算法 \(rounding()\) 从这里回溯到此 ...

  4. Training Deep Neural Networks

    http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html  //转载于 Training Deep Neural ...

  5. (zhuan) Evolution Strategies as a Scalable Alternative to Reinforcement Learning

    Evolution Strategies as a Scalable Alternative to Reinforcement Learning this blog from: https://blo ...

  6. KDD2016,Accepted Papers

    RESEARCH TRACK PAPERS - ORAL Title & Authors NetCycle: Collective Evolution Inference in Heterog ...

  7. An overview of gradient descent optimization algorithms

    原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimiz ...

  8. (转) An overview of gradient descent optimization algorithms

    An overview of gradient descent optimization algorithms Table of contents: Gradient descent variants ...

  9. First release of mlrMBO - the toolbox for (Bayesian) Black-Box Optimization

    We are happy to finally announce the first release of mlrMBO on cran after a quite long development ...

随机推荐

  1. grunt-inline:一个资源内嵌插件

    一.插件简介 将引用的外部资源,如js.css.img等,内嵌到引用它们的文件里去. 二.使用场景 在项目中,出于某些原因,有的时候我们需要将一些资源,比如js脚本内嵌到页面中去.比如我们的html页 ...

  2. 作业五:分析system_call中断处理过程

    分析system_call中断处理过程 一.MesuSO增加getpid和getpid-asm 二.使用GDB跟踪系统调用内核函数sys_getpid 分析system_call中断处理过程 使用gd ...

  3. 小学四则运算APP 第一个冲刺阶段 第一天

    团队成员:陈淑筠.杨家安.陈曦 团队选题:小学四则运算APP 第一次冲刺阶段时间:11.17~11.27 思考:初步了解小学四则运算数是在100以内的加减乘除,首先先从简单的地方入手,把最基础的算法功 ...

  4. Visual Studio2013安装过程

    Visual Studio是微软开发的一套基于组件的软件开发工具,我选择安装的是Visual Studio2013版本.首先, 第一步是要找到一个安装包: 我们可以直接百度MSDN,显示的第一条就是官 ...

  5. PAT乙级(Basic Level)练习题-NowCoder数列总结

    题目描述 NowCoder最近在研究一个数列: F(0) = 7 F(1) = 11 F(n) = F(n-1) + F(n-2) (n≥2) 他称之为NowCoder数列.请你帮忙确认一下数列中第n ...

  6. 3.23日PSP

    工作 类型 日期 开始时间 结束时间 中断时间 净时间 搭hadoop环境(已终止) 技能 3.23 00:00 00:50 0min 50min 看构建之法 学习 3.23 9:30 10:00 3 ...

  7. hdu 5274 Dylans loves tree (树链剖分 + 线段树 异或)

    Dylans loves tree Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 131072/131072 K (Java/Othe ...

  8. import static和import的区别(转)

    import static静态导入是JDK1.5中的新特性.一般我们导入一个类都用 import com.....ClassName;而静态导入是这样:import static com.....Cl ...

  9. 面试题:get和post的本质区别

    前言:相信小伙伴们面试时候一定都遇到过这个问题,即使没有遇到过,至少也听说过,网上资料一大片,大概每个人都能说出来一些.但是总感觉面试装逼不成功,所以就翻阅了部分资料,进一步整理了下. 一般当我们提到 ...

  10. 算法-动态规划DP小记

    算法-动态规划DP小记 动态规划算法是一种比较灵活的算法,针对具体的问题要具体分析,其宗旨就是要找出要解决问题的状态,然后逆向转化为求解子问题,最终回到已知的初始态,然后再顺序累计各个子问题的解从而得 ...