【论文翻译】An overiview of gradient descent optimization algorithms
这篇论文最早是一篇2016年1月16日发表在Sebastian Ruder的博客。本文主要工作是对这篇论文与李宏毅课程相关的核心部分进行翻译。
论文全文翻译:
An overview of gradient descent optimization algorithms
梯度下降优化算法概述
0. Abstract 摘要:
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
梯度下降优化算法虽然很流行,但通常用作黑盒优化,所以对于它们的优缺点很难作出实际的解释。
This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.
这篇论文旨在帮助读者建立对于不同算法性能表现的直觉,以便更好地使用这些算法。
In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
这篇论文介绍了几种不同的梯度下降算法,以及它们所面临的挑战。还介绍了最常用的优化算法,并行和分布式架构,以及其他梯度下降算法优化的策略。
1. Introduction 引言:
Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.
梯度下降是最流行的其中一种执行优化的算法,也是到目前为止用的最多的神经网络优化算法。
At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).
同时,各种最新的深度学习库(如:lasagne,caffe,keras)都实现了很多种梯度下降的优化算法。
These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.
然而,这些算法通常用作黑盒优化,很难对于它们的优缺点作出实际解释。
This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.
这篇论文旨在帮助读者建立对于不同梯度下降优化算法的性能表现的直觉,以便更好地使用这些算法。
In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.
在第二章,我们首先看一下不同的梯度下降算法,然后在第三章,简要总结一下算法训练过程中面临的挑战。
Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.
接下来,在第四章介绍了最常见的优化算法,以及它们如何应对挑战,并相应地更新规则。
Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.
然后,在第五章简要介绍了在并行及分布式架构下梯度下降的优化算法及框架。
Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.
最后,第六章介绍了一些其他有用的梯度下降优化策略。
Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in R^d\) by updating the parameters in the opposite direction of the gradient of the objective function \({\nabla}_{\theta} J({\theta})\) w.r.t. to the parameters.
梯度下降方法就是对于目标函数 \(J(\theta)\),计算梯度 \({\nabla}_{\theta} J({\theta})\) ,并负向更新参数 \(\theta \in R^d\),使得目标函数最小。
The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum.
学习率 \(\eta\) 确定了我们逼近(局部)最小值的步长。
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
换而言之,就是我们沿着目标函数的斜坡下降的方向走,知道到达谷底。
2. Gradient descent variants 梯度下降的变体
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
梯度下降有三种变体,他们的不同之处在于用来计算目标函数下降梯度的数据量不同。
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
根据数据量的不同,我们在参数更新的精度和更新时间之间作出权衡。
2.1 Batch gradient descent 批量梯度下降
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:
Vanilla梯度下降,也叫作批量梯度下降,通过整个训练数据集,计算损失函数关于参数 \(\theta\) 的梯度:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.
由于每次更新都需要通过整个数据集来计算梯度,所以批量梯度下降的计算速度很慢,而且对于超出内存限制的数据量很难处理。
Batch gradient descent also does not allow us to update our model online, i.e. with new examples on-the-fly.
批量梯度下降也不允许在线更新模型,也就是在运行中不能添加新的样本数据。
In code, batch gradient descent looks something like this:
批量梯度下降的代码如下:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector params_grad of the loss function for the whole dataset w.r.t. our parameter vector params.
对于一个给定的迭代次数epochs,我们首先利用整个数据集计算关于参数向量 params 的损失函数 param_grad 的梯度。
Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters.
注意,很多最新的深度学习库都提供了自动求导的功能,可以高效地计算关于参数的梯度。
If you derive the gradients yourself, then gradient checking is a good idea.
如果你自己实现梯度计算,那么梯度检查是很好的。
We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform.
接下来我们沿着负梯度方向更新参数,更新参数的步长由学习率决定。
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
批量梯度下降保证最终将收敛到凸函数的全局最小值,或者非凸函数的局部最小值。
2.2 Stochastic gradient descent 批量梯度下降
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example \(x^{(i)}\) and label \(y^{(i)}\) :
相对而言,随机梯度下降算法(SGD)是对其中一个训练样本(\(x^{(i)}, y^{(i)}\))求梯度并更新参数:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta; x^{(i)}, y^{(i)}})\) ---- (2)
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.
批量梯度下降在大数据集上会有很多冗余计算,因为它在每次更新参数时重复计算相似样本的梯度。
SGD does away with this redundancy by performing one update at a time.
随机梯度下降(SGD)每次通过单个样本更新参数以消除冗余。
It is therefore usually much faster and can also be used to learn online.
因此它通常速度更快且可以在线学习。
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Figure 1.
SGD更新更加频繁,其损失函数的方差更大,导致目标函数剧烈震荡。
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.
批量梯度下降的参数会收敛到参数所在波谷的局部最小值,而随机梯度下降(SGD)则由于数据波动,可能跳跃到一个新的更好的局部最小值。
On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.
另一方面,最终收敛到确切最小值的这一过程变得更加复杂,因为SGD的参数变化在持续震荡。
However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
然而,事实证明,当我们逐渐地减小学习率,SGD表现出和批量梯度下降一样的收敛效果,同样收敛到了局部最小值(非凸)或全局最小值(凸优化)。
Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example.
SGD的代码片段仅仅是在对各组训练样本的遍历和利用每一组样本计算梯度的过程中增加一层循环。
Note that we shuffle the training data at every epoch as explained in Section 6.1.
注意我们在每一次循环中都要先对训练数据进行“洗牌”。
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
2.3 Mini-batch gradient descent 小批量梯度下降
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:
小批量梯度下降集合了上面两种方法的优点,每次对n个训练样本进行小批量的参数更新。
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta}; x^{i:i+n}; y^{i:i+n})\) ---- (3)
This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence;
and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.
这种方式,一方面减少了参数更新时的方差,收敛地更平稳;另一方面,能够更高效地利用最新的深度学习库的矩阵计算优化技术来计算梯度。
Common mini-batch sizes range between 50 and 256, but can vary for different applications.
小批量的大小一般在50~256之间,也可以根据具体应用来调整。
Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
小批量梯度下降是典型的神经网络训练算法之一,SGD一词也可以指小批量梯度下降算法。
Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.
注意:为了简便起见,下文对于SGD的改进中我们省略了\(x^{(i:i+n)}; y^{(i:i+n)}\)参数。
In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
代码如下。不同于之前遍历每个单一样本,我们现在迭代的是每个大小为50个样本的小批量:
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
【论文翻译】An overiview of gradient descent optimization algorithms的更多相关文章
- (转) An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms Table of contents: Gradient descent variants ...
- An overview of gradient descent optimization algorithms
原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimiz ...
- An overview of gradient descent optimization algorithms (更新到Adam)
Momentum:解快了收敛速度,同时也减弱了SGD的波动 NAG: 减速了Momentum更新参数太快 Adagrad: 出现频率较低参数采用较大的更新,对于出现频率较高的参数采用较小的,不共用一个 ...
- 论文翻译:2021_Decoupling magnitude and phase optimization with a two-stage deep network
论文地址:两阶段深度网络的解耦幅度和相位优化 论文代码: 引用格式:Li A, Liu W, Luo X, et al. ICASSP 2021 deep noise suppression chal ...
- <反向传播(backprop)>梯度下降法gradient descent的发展历史与各版本
梯度下降法作为一种反向传播算法最早在上世纪由geoffrey hinton等人提出并被广泛接受.最早GD由很多研究团队各自发表,可他们大多无人问津,而hinton做的研究完整表述了GD方法,同时hin ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- Stochastic Gradient Descent
一.从Multinomial Logistic模型说起 1.Multinomial Logistic 令为维输入向量; 为输出label;(一共k类); 为模型参数向量: Multinomial Lo ...
- 课程二(Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization),第二周(Optimization algorithms) —— 2.Programming assignments:Optimization
Optimization Welcome to the optimization's programming assignment of the hyper-parameters tuning spe ...
- 论文翻译--StarCraft Micromanagement with Reinforcement Learning and Curriculum Transfer Learning
(缺少一些公式的图或者效果图,评论区有惊喜) (个人学习这篇论文时进行的翻译[谷歌翻译,你懂的],如有侵权等,请告知) StarCraft Micromanagement with Reinforce ...
随机推荐
- Ceph 存储集群-低级运维
低级集群运维包括启动.停止.重启集群内的某个具体守护进程:更改某守护进程或子系统配置:增加或拆除守护进程.低级运维还经常遇到扩展.缩减 Ceph 集群,以及更换老旧.或损坏的硬件. 一.增加/删除 O ...
- 为什么用nginx:它的5个主要优点
1.高并发,高性能 2.可扩展性好啊 3.高可靠性 4.热部署 5.BSD许可证
- Django模型中的admin后台管理无法显示字段
在执行django后台管理时,登陆到http://127.0.0.1:8000/admin/,进入页面后没有对应的字段显示.请解决? 代码: models.py from django.db impo ...
- WeChall_Training: Get Sourced (Training)
The solution is hidden in this page Use View Sourcecode to get it 解题: 网页源码,最后一行 <!-- You are look ...
- Invoking Descriptors - Python 描述符的用法建议
描述符用法建议, 内置的 property 类创建的是'覆盖型'(date descriptor), 实现了 __set__ 和 __get__. 特性 property 的 __set__ 方法 默 ...
- JMeter接口测试-接口签名校验
前言 很多HTTP接口在传参时,需要先对接口的参数进行数据签名加密 如pinter项目的中的签名接口 http://localhost:8080/pinter/com/userInfo 参数为: {& ...
- 数据算法 --hadoop/spark数据处理技巧 --(17.小文件问题 18.MapReuce的大容量缓存)
十七.小文件问题 十八.MR的大容量缓存 在MR中使用和读取大容量缓存,(也就是说,可能包括数十亿键值对,而无法放在一个商用服务器的内存中).本次提出的算法通用,可以在任何MR范式中使用.(eg:MR ...
- 能否不同udp socket绑定到同一IP地址和port
http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-4.html http ...
- 跨域的两种解决方法jsonp和CORS
1.跨域 什么是跨域? 当你请求的url是不同源的数据的时候,浏览器一般会抛出请求跨域的错误,如下图: 造成跨域的原因? 即你违反了浏览器的同源策略的限制=>阻止一个域的js脚本和另外一个域的内 ...
- k8s系列---基于canal的网络策略
文章拷自:http://blog.itpub.net/28916011/viewspace-2215383/ 加上自己遇到的问题简单记录 安装文档:https://docs.projectcalico ...