【论文翻译】An overiview of gradient descent optimization algorithms
这篇论文最早是一篇2016年1月16日发表在Sebastian Ruder的博客。本文主要工作是对这篇论文与李宏毅课程相关的核心部分进行翻译。
论文全文翻译:
An overview of gradient descent optimization algorithms
梯度下降优化算法概述
0. Abstract 摘要:
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
梯度下降优化算法虽然很流行,但通常用作黑盒优化,所以对于它们的优缺点很难作出实际的解释。
This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.
这篇论文旨在帮助读者建立对于不同算法性能表现的直觉,以便更好地使用这些算法。
In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
这篇论文介绍了几种不同的梯度下降算法,以及它们所面临的挑战。还介绍了最常用的优化算法,并行和分布式架构,以及其他梯度下降算法优化的策略。
1. Introduction 引言:
Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.
梯度下降是最流行的其中一种执行优化的算法,也是到目前为止用的最多的神经网络优化算法。
At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).
同时,各种最新的深度学习库(如:lasagne,caffe,keras)都实现了很多种梯度下降的优化算法。
These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.
然而,这些算法通常用作黑盒优化,很难对于它们的优缺点作出实际解释。
This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.
这篇论文旨在帮助读者建立对于不同梯度下降优化算法的性能表现的直觉,以便更好地使用这些算法。
In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.
在第二章,我们首先看一下不同的梯度下降算法,然后在第三章,简要总结一下算法训练过程中面临的挑战。
Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.
接下来,在第四章介绍了最常见的优化算法,以及它们如何应对挑战,并相应地更新规则。
Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.
然后,在第五章简要介绍了在并行及分布式架构下梯度下降的优化算法及框架。
Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.
最后,第六章介绍了一些其他有用的梯度下降优化策略。
Gradient descent is a way to minimize an objective function \(J(\theta)\) parameterized by a model's parameters \(\theta \in R^d\) by updating the parameters in the opposite direction of the gradient of the objective function \({\nabla}_{\theta} J({\theta})\) w.r.t. to the parameters.
梯度下降方法就是对于目标函数 \(J(\theta)\),计算梯度 \({\nabla}_{\theta} J({\theta})\) ,并负向更新参数 \(\theta \in R^d\),使得目标函数最小。
The learning rate \(\eta\) determines the size of the steps we take to reach a (local) minimum.
学习率 \(\eta\) 确定了我们逼近(局部)最小值的步长。
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
换而言之,就是我们沿着目标函数的斜坡下降的方向走,知道到达谷底。
2. Gradient descent variants 梯度下降的变体
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
梯度下降有三种变体,他们的不同之处在于用来计算目标函数下降梯度的数据量不同。
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
根据数据量的不同,我们在参数更新的精度和更新时间之间作出权衡。
2.1 Batch gradient descent 批量梯度下降
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters \(\theta\) for the entire training dataset:
Vanilla梯度下降,也叫作批量梯度下降,通过整个训练数据集,计算损失函数关于参数 \(\theta\) 的梯度:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta})\) ---- (1)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.
由于每次更新都需要通过整个数据集来计算梯度,所以批量梯度下降的计算速度很慢,而且对于超出内存限制的数据量很难处理。
Batch gradient descent also does not allow us to update our model online, i.e. with new examples on-the-fly.
批量梯度下降也不允许在线更新模型,也就是在运行中不能添加新的样本数据。
In code, batch gradient descent looks something like this:
批量梯度下降的代码如下:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector params_grad of the loss function for the whole dataset w.r.t. our parameter vector params.
对于一个给定的迭代次数epochs,我们首先利用整个数据集计算关于参数向量 params 的损失函数 param_grad 的梯度。
Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters.
注意,很多最新的深度学习库都提供了自动求导的功能,可以高效地计算关于参数的梯度。
If you derive the gradients yourself, then gradient checking is a good idea.
如果你自己实现梯度计算,那么梯度检查是很好的。
We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform.
接下来我们沿着负梯度方向更新参数,更新参数的步长由学习率决定。
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
批量梯度下降保证最终将收敛到凸函数的全局最小值,或者非凸函数的局部最小值。
2.2 Stochastic gradient descent 批量梯度下降
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example \(x^{(i)}\) and label \(y^{(i)}\) :
相对而言,随机梯度下降算法(SGD)是对其中一个训练样本(\(x^{(i)}, y^{(i)}\))求梯度并更新参数:
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta; x^{(i)}, y^{(i)}})\) ---- (2)
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.
批量梯度下降在大数据集上会有很多冗余计算,因为它在每次更新参数时重复计算相似样本的梯度。
SGD does away with this redundancy by performing one update at a time.
随机梯度下降(SGD)每次通过单个样本更新参数以消除冗余。
It is therefore usually much faster and can also be used to learn online.
因此它通常速度更快且可以在线学习。
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Figure 1.
SGD更新更加频繁,其损失函数的方差更大,导致目标函数剧烈震荡。
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.
批量梯度下降的参数会收敛到参数所在波谷的局部最小值,而随机梯度下降(SGD)则由于数据波动,可能跳跃到一个新的更好的局部最小值。
On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.
另一方面,最终收敛到确切最小值的这一过程变得更加复杂,因为SGD的参数变化在持续震荡。
However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
然而,事实证明,当我们逐渐地减小学习率,SGD表现出和批量梯度下降一样的收敛效果,同样收敛到了局部最小值(非凸)或全局最小值(凸优化)。
Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example.
SGD的代码片段仅仅是在对各组训练样本的遍历和利用每一组样本计算梯度的过程中增加一层循环。
Note that we shuffle the training data at every epoch as explained in Section 6.1.
注意我们在每一次循环中都要先对训练数据进行“洗牌”。
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
2.3 Mini-batch gradient descent 小批量梯度下降
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of \(n\) training examples:
小批量梯度下降集合了上面两种方法的优点,每次对n个训练样本进行小批量的参数更新。
\(\theta = \theta - \eta · {\nabla}_{\theta} J ({\theta}; x^{i:i+n}; y^{i:i+n})\) ---- (3)
This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence;
and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.
这种方式,一方面减少了参数更新时的方差,收敛地更平稳;另一方面,能够更高效地利用最新的深度学习库的矩阵计算优化技术来计算梯度。
Common mini-batch sizes range between 50 and 256, but can vary for different applications.
小批量的大小一般在50~256之间,也可以根据具体应用来调整。
Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
小批量梯度下降是典型的神经网络训练算法之一,SGD一词也可以指小批量梯度下降算法。
Note: In modifications of SGD in the rest of this post, we leave out the parameters \(x^{(i:i+n)}; y^{(i:i+n)}\) for simplicity.
注意:为了简便起见,下文对于SGD的改进中我们省略了\(x^{(i:i+n)}; y^{(i:i+n)}\)参数。
In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
代码如下。不同于之前遍历每个单一样本,我们现在迭代的是每个大小为50个样本的小批量:
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad