1. 基本概念(Momentum vs SGD)

Momentum 用于加速 SGD(随机梯度下降)在某一方向上的搜索以及抑制震荡的发生。

  • GD(gradient descent)

    θt=θt1ηJθ(θ)θ=θηJ(θ)

    for i in range(num_epochs):
        params_grad = evaluate_gradient(loss_function, data, params)
        params = params - learning_rate * params_grad
  • SGD(stochastic gradient descent)

    θt=θt1ηJθ(θ;x(i),y(i))θ=θηJ(θ;x(i),y(i))

    for i in range(num_epochs):
        np.random.shuffle(data)
        for example in data:
            params_grad = evaluate_gradient(loss_function, example, params)
            params = params - learning_rate * params_grad
  • Momentum(冲量/动量)

    vt=γvt1+ηθJ(θ)θ=θvt

    for i in range(num_epochs):
        params_grad = evaluate_gradient(loss_function, data, params)
        v = gamma*v + learning_rate*params_grad
        params = params - v

    γ 即为此处的动量,要求 γ<1,一般取 γ=0.9 或者更小的值,如本文第二节所示,还可以在迭代过程中设置可变的 γ

2. 可变动量设置

maxepoch = 50;
initialmomentum = .5;
finalmomentum = .9;

for i = 1:maxepoch
    ...
    if i < maxepoch/2
        momentum = initialmomentum
    else
        momentum = finalmomentum
    end
    ... 
end
posted on 2017-04-02 10:37  未雨愁眸  阅读(414)  评论(0编辑  收藏  举报