Gradient Descent and Variants - Convergence Rate Summary

Credits

Ben Retch - Berkeley EE227C Convex Optimization Spring 2015
Moritz Hardt - The Zen of Gradient Descent
Yu. Nesterov - Efficiency of coordinate descent methods on huge-scale optimization problems
Peter Richtarik, Martin Takac - Iteration Complexity of Randomized Block-Coordinate Descent Methods for Minimizing a Composite Function

Goals

Summary the convergence rate of various gradient descent variants.

Gradient Descent
Gradient Descent with Momentum
Stochastic Gradient Descent
Coordinate Gradient Descent

with a focus on the last one.

1. Gradient Descent

1.1. Defining Algorithm

Gradient Descent in 2D. Images Credit: http://vis.supstat.com/2013/03/gradient-descent-algorithm-with-r/

The goal here is to minimize a convex function

Definition [Convex function] A function

f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y)

Graphically, it means if we connect two points in the graph of the function to create a linear line, that linear line lies above the function (for those points in between). We often work with a nicer definition of convex function, when it is differentiable, as in

Theorem [First Order Condition] Suppose

f (y) \geq f (x) + \nabla f (x) T (y - x), \forall x, y

Graphically, it means the tangent line lies below the function at any point. Finally, we state the second-order condition for completeness.

Theorem [Second Order Condition] Assume that

Working with general convex function turns out to be very hard, we instead need the following condition

Definition [L-Lipschitz Gradient]

| \nabla f (x) - \nabla f (y) | \leq L | | x - y | |

\Leftrightarrow f (y) \leq f (x) + \nabla f (x) T (y - x) + L 2 | | y - x | | 2

Graphically, it means the function is not too convex, it is upperbounded by a quadratic function. Having this condition is necessary in most of the convergence result in gradient descent. Having an additional condition will make life even easier, this condition is stated in

Definition [m Strongly Convex]

f (y) \geq f (x) + \nabla f (x) T (y - x) + m 2 | | y - x | | 2

Basically, it is the opposite of L-Lipschitz gradient, it means the function is not too flat, it is lowerbounded by some quadratic function. We know that at the minimum, a function

We are now ready to define the Gradient Descent algorithm:

Algorithm [Gradient Descent] For a stepsize

Initialize
For

Basically, it adjust the

1.2. Convergence Rate

Theorem [Rate for L-Lipschitz and m Strongly Convex]. If

f (x k + 1) - f (x ⋆) \leq (1 - m L ) ( f ( x k ) - f ( x ⋆ ) )

\Rightarrow f (x k + 1 - f (x ⋆) \leq (1 - m L ) k ( f ( x k ) - f ( x ⋆ ) )

We say the function values converges linearly to the optimal value. Also, since we have the relation between function values and input values, we have

Theorem [Rate for L-Lipschizt] If f has L-Lipschitz gradient, then

f (x k) - f (x ⋆) \leq 2 L k + 1 | | x 0 - x ⋆ | | 2

The convergence rate is not as good, since we are in a more general case. We say the function values converges in log. For an error threshold of

We quickly mention the (Nesterov) momentum method here, basically, each iteration, instead of updating

Algorithm [Nesterov Momentum] The update rule for Nesterov method, for constant stepsize

x k + 1 = x k - α \nabla f (x k + β (x k - x k - 1) +

+ β (x k - x k - 1)

If we were to be careful with the analysis before, for L-Lipschitz gradient and strongly convex function with parameter

2. Coordinate Descent

2.1. Defining Algorithm

Coordinate Descent in 2D. Images Credit: Martin Takac

In Machine Learning, we sometimes work with the case where the dimension is too big, or there is too many datapoint. Consider a data matrix

For those problem where calculating coordinate gradient (i.e. partial derivative) is simple, it turns out the the rate for coordinate descent is as good as for typical gradient descent. First let's define the L-Lipschitz condition coordinatewise

Definition [Coordinate-wise Lipschitz gradient]

∥ \nabla f (x + h i) - \nabla f (x) ∥ \leq L i ∥ h i ∥

We assume our function

Algorithm [Randomized Coordinate Descent]

Pick an initial point
For
- pick coordinate
- compute

Here we introduce the notation

2.2. Convergence in Expectation

Theorem [Rate Coordinate Descent with Lipschitz] If we run the above algorithm for coordinate-wise L-Lipschitz gradient, we have

E k - 1 f (x k) - f ⋆ \leq 2 n k + 4 R 2 ( x 0 ) ,

R (x 0) = max x [max x ⋆ \in X ⋆ ∥ x - x ⋆ ∥ 1] : f (x) \leq f (

So basically, we have the log-convergence rate in expectation, very similar to Gradient Descent. Analogously, the result for strongly convex (globally, not coordinate-wise) is stated in

Theorem [Rate Coordinate Descent with Lipschitz and Strongly Convex m] If we run the above algorithm, we have

E k - 1 f (x k) - f ⋆ \leq (1 - m n ) k ( f ( x 0 ) - f ⋆ )

Note that here

So basically, we get that for Strongly convex and L-Lipschitz gradient, we also get linear convergence rate in the expectation for Coordinate Descent.

2.3. High Probability Statement

One might also wonder that maybe it works on average, but we only run it once, what is the probability that the result we get from that one time is good. It turns out that our result is good with high probability, as seen in Peter Richtarik, Martin Takac paper. The idea is to used Markov inequality to convert a statement in expectation to a high probability statement. To summary, for a fix confidence interval

k \geq O (2 n ϵ log f ( x 0 ) - f ( x ⋆ ) ϵ ρ ) ,

If in addition, we have strongly convex, then the number of iteration needed is only

k \geq O (n log (f ( x 0 ) - f ( x ⋆ ) ρ ϵ ) ) .

Staring at these high-probability result, we see that the number of iteration needed is almost identical to the case of vanilla Gradient Descent. We have

Finally, on a note about momentum for Coordinate Descent, it seems Nesterov recommends not using it, because of the computation complexity for getting the momentum.

3. Stochastic Gradient Descent

Popular optimization algorithms. Images Credit: Daniel Nouri

It is quite surprised for me that analyzing Stochastic Gradient Descent is much harder than Coordinate Descent. The two algorithms sounds very similar, it is just the former one is vertical, while the later one is horizontal. SGD in fact works very well in practice, it is just proving convergence result is harder. For strongly convex, it seems we only get log convergence rate (as compared to linear in Gradient Descent), as seen in SGD for Machine Learning. For non-strongly convex, we get half the rate. Why??? What is the rate of SGD? To be discovered and written. If you have some ideas please comment.

posted @ 2015-11-25 19:57 菜鸡一枚阅读(534) 评论(0) 收藏举报

刷新页面返回顶部

菜鸡一枚

Gradient Descent and Variants - Convergence Rate Summary

Gradient Descent and Variants - Convergence Rate Summary

Credits

Goals

1. Gradient Descent

1.1. Defining Algorithm

1.2. Convergence Rate

2. Coordinate Descent

2.1. Defining Algorithm

2.2. Convergence in Expectation

2.3. High Probability Statement

3. Stochastic Gradient Descent

公告