Chapter 2: Optimization

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

What is optimization? Use algorithms to minimize a function. In machine learning, minimizing a loss function is very important.

GD: Gradient descent

To minimize a function f(x), we use the following formula to update. (𝜂 is step size/learning rate)

xt+1=xtηf(xt)

Smoothness assumption

(a) 2f(x)L

(b)f(x)f(y)Lxyf(x)f(y),xyLxy2

(c)|f(y)f(x)f(x),yx|L2yx2

这三者等价。证明如下:

(a) (b): 也是泰勒展开成积分的形式,因为我们是要利用二阶导的性质得到关于一阶导的性质。

f(w)f(w)= 012f(w+t(ww))dt(ww)  () 012f(w+t(ww))dt(ww)  () Lww  (a)

(b) (c): 套路类似,泰勒展开,利用积分的形式实现降维

|f(w)f(w)f(w),ww|= |01f(w+t(ww))dt,wwf(w),ww|  ()= |01f(w+t(ww))f(w),wwdt| 01f(w+t(ww))f(w)wwdt  () 01Lt(ww)wwdt  (b)= Lww201tdt =L2ww2

(c) (a):

(a)泰勒展开

12(ww)T2f(w+θ(ww))(ww)L2ww2

Assume w=w+tv,v=1

vT2f(w+θtv)vL

Let t0ww), we can get $ vT\nabla2 f(w)v \le L$.

2f(w)=λmax(2f(w))=maxv=1vT2f(w)vL

Convexity assumption

(a) 2f(w)0

(b) f(w)f(w)+f(w),ww

(c) f(tx+((1t)y)tf(x)+(1t)f(y)

GD convergence guarantee

Using these 2 assumptions and the update rule, we can get a GD guarantee.

Lemma: Combine L smooth and update rule, we can get

f(wi+1)f(wi)η2f(wi)

Proof: Use L smooth condition (c) and the update rule

f(wi+1)f(wi)+f(wi),ηf(wi)+Lη22f(wi)2=f(wi)+η(1Lη2)f(wi)2

set learning rate η=1L, then we can get:

12Lf(wi)2=η2f(wi)2f(wi)f(wi+1)f(wi)f(w)

begin with the lemma

f(wi+1)f(wi)η2f(wi)  (f(wi)没法telescoping)= f(wi)12ηwiwi+12  (update rule) f(w)f(wi),wwi12ηwiwi+12  (convex)= f(w)1ηwiwi+1,wwi12ηwiwi+12  (update rule)= f(w)12η(wiwi+122wiwi+1,wiw)  ()= f(w)12η(wiwi+122wiwi+1,wiw+wiw2)+12ηwiw2= f(w)+12ηwiw212ηwi+1w2

然后开始telescoping。第一个不等号是因为non increasing

T(f(wT)f(w))i=0T1f(wi+1)Tf(w)w0w2wTw22ηw0w22η

μ-strongly convex assumption

(a) 2f(x)μ

(b) f(x)f(y),xyμxy2f(x)f(y)μxy

(c) f(y)f(x)f(x),yxμ2yx2

和前面L - smooth类似,二阶导,导数,原函数的性质各一条。证明也类似,此处略过。

GD can get a linear convergence rate on an L-smooth μ-strongly convex function.

Then let's move on to the proof of the question. Set learning rate η=1L.

wT+1w2=wTηf(wT)w2=wTw22ηf(wT),wTw+η2f(wT)wTw2+2ηf(wT),wwT+2η(f(wT)f(w))  (lemma)wTw2+2η(f(w)f(wT)μ2wTw2)+2η(f(wT)f(w))  (strongly convex)=(1μL)wTw2

wTw2(1μL)Tw0w2  (telescoping)f(wT)f(w)f(w),wTw+L2wTw2=L2wTw2  (f(w)=0)L2(1μL)Tw0w2

SGD: Stochastic gradient descent

but gradient descent has 2 limitations:

  • Computing full gradient is slow for big data
  • Could get stuck at stationary points (stationary point & local minimum)

So we add some randomness to the GD algorithm

xt+1=xtηGt1ηE[xtxt+1]=E[Gt]=L(wt)=1NiL(w,xi,yi)V(Gt)=E[Gt2]L(wt)2  (方差的定义)Gt=1|S|iSL(w,xi,yi)  for a randomly selected subset S

|S| is called the mini-batch size, or simply batch size.

  • If |S| is too small, say, |S|=1, Gt still has the right expectation, but with very large variance
  • If |S| is too big, say, |S|=n, Gt is slow to compute
  • Small but not too small |S| has at least two benefits
    • Has some noise to help escape saddle points or bad local minimum
    • Helps to get the right mini-batch statistics for batch normalization (不懂)

If we consider update rule of SGD as xt+1=xtη(L(wt)+ξt). Full batch + white noise is also SGD.

We run SGD with bounded variance Var(Gt)σ2

convergence analysis

because the update rule is different, so we can not begin with the lemma before. We need to propose a new lemma.

Lemma: combine L smooth and the update rule, then we can get

E[f(wi+1)]f(wi)η2f(wi)2+ησ22

Proof: Use L smooth condition (c) and the update rule

E[f(wi+1)]E[f(wi)+f(wi),wi+1wi+L2wi+1wi2]= f(wi)+f(wi),E[(wi+1wi)]+L2E[wi+1wi2]  (线)= f(wi)ηf(wi)2+η2E[Gi2]  (update rule)= f(wi)ηf(wi)2+η2(V(Gi)+f(wi)2)  (方差的定义) f(wi)η2f(wi)2+ησ22  (bounded方差的条件)

SGD convergence guarantee

begin with E[f(wi+1)]f(wi)ηf(wi)2+η2E[Gi2] in the proof of lemma

E[f(wi+1)] f(wi)ηf(wi)2+η2E[Gi2] f(wi)η2E[Gi2]+ησ2  (方差的定义) f(w)f(wi),wwiη2E[Gi2]+ησ2  (convex)= f(w)1ηE[(wiwi+1)],wwi12ηE[wiwi+12]+ησ2(update rule)= f(w)12ηE(wiwi+122wiwi+1,wiw)+ησ2  ()= f(w)12ηE(wiwi+122wiwi+1,wiw+wiw2wiw2)+ησ2= f(w)+12ηE(wiw2wi+1w2)+ησ2

然后开始telescoping

i=0T1E[f(wi+1)]Tf(w)12ηE(w0w2wTw2)+Tησ212ηw0w2+Tησ2E[f(w¯)]f(w)E [1Ti=0T1f(wi+1)]f(w)w0w22ηT+ησ2  (琴生不等式)

convergence rate 1T

proof: set η=ϵ2σ2

E[f(w¯)]f(w)w0w2σ2ϵT+ϵ2  (=ϵ)

So T=2w0w2σ2ϵ2. so convergence rate is 1T

Thinking: It seems SGD is much slower than GD, why do we still use it?

  • Noise 可能没有那么大
  • Sometimes noise helps, especially in non-convex settings (we will see it later)
  • ϵ 不需要到那么小: We do not care about training loss, but care about population loss. Even with 𝜖 = 0.1, it does not guarantee the population loss will be better compared with 𝜖 =0.1
  • The variance of the noise can be reduced gradually along the process. (therefore as fast as GD)

If the variance σ is the problem, can we make it smaller to have a faster convergence rate?

The answer is yes, we could reduce the variance and get 1𝑇 rate for convex function, and linear convergence for strongly convex function.

Unfortunately, it seems to only work for convex function, not useful for deep learning (yet).

There are many such algorithms, i.e., SAG, SVRG, SAGA, etc.

SVRG

assumption: L-smooth and μ-strongly convex

Update rule

For s=1,2,...,T,...
w¯=w¯s1 init the start point of this step

f(w¯)=1Ni=1Nli(w¯) (calucalte full gradient of the start point)

w0=w¯ init the start point of the micro for loop below

​ For t = 1, 2, ..., m

​ Random pick i[N] 随机选取一条数据的li计算gradient

wt=wt1η(li(wt1)li(w¯)+f(w¯)) 用single gradient来更新

w¯s=wm 小循环结束之后,更新终点作为下一step的起点

备注,在下面证明的时候,ws¯=wrand(t)。但是in practical 是上面那种做法

thinking

We compute the full gradient for every step. So although a full gradient is expensive, the amortized cost is small.

We assume f(wt)f(w¯)li(wt)li(w¯)

Convergence analysis

先研究小循环的一步,会造成什么影响

Ewtw2=Ewt1ηutw2  (update rule)=wt1w22wt1w,ηE[ut]+E[ut2]

小循环中的 i 是随机的,那么要去掉这个随机,那就是用E。加上E之后我们操作空间就大了

E[ut]=E[(li(wt1)li(w¯)+f(w¯))]=E[li(wt1)]E[li(w¯)]+f(w¯)=f(wt1)

至于E[ut2] 单独算一下,目标: 找到一个不含E的上界

E[ut2]=E[li(wt1)li(w¯)+f(w¯)2]2E[li(wt1)li(w)2]+2E[li(w¯)li(w)f(w¯)2]  (w)=2E[li(wt1)li(w)2]+2E[li(w¯)li(w)E[li(w¯)]2]      (single gradient,full gradientsingle gradient)=2E[li(wt1)li(w)2]+2E[li(w¯)li(w)E[li(w¯)li(w)]2]      (E[li(w)]=f(w)=0)2E[li(wt1)li(w)2]+2E[li(w¯)li(w)2]      (V(X)=E[XEX2]=E[X2](EX)2E[X2],full gradient)4L(f(wt1)f(w)+f(w¯)f(w))

(最后一步使用的结论E[li(w)li(w)]2L(f(w)f(w)),见sys博客[机器学习] 2. 随机方差缩减梯度下降 SVRG - shiys22 - 博客园 (cnblogs.com),证这个结论需要L-smooth)

以上的放缩都没有用到smooth和convex的条件。

现在回头代入之前的式子

Ewtw2=wt1w22wt1w,ηE[ut]+η2E[ut2]wt1w22ηwt1w,f(wt1)+4η2L(f(wt1)f(w)+f(w¯)f(w))wt1w22η(f(wt1)f(w))+4η2L(f(wt1)f(w)+f(w¯)f(w))  (,convex)=wt1w2+2η(2ηL1)(f(wt1)f(w))+4η2L(f(w¯)f(w))2η(12ηL)(f(wt1)f(w))4η2L(f(w¯)f(w))+wt1w2Ewtw2

That is Ewtw2E[wt1w2]xxx,显然可以telescoping,当然是对于这一整个小循环来telescoping

在这里会有一些添加E 的操作,我简单说一下。

前面的证明考虑的是,小循环中的一步wt1wt,所以前面wt1w2没有加E。但是现在telescoping考虑整个小循环,就要加上E

但是对于一整个小循环来说f(w¯)是一开始init的时候就固定好了的。

左侧累加

t=1mf(wt1)f(w)=m(E[f(w¯s)]f(w))=mE[f(w¯s)f(w)]

右侧累加

t=1m4η2L(f(w¯)f(w))+Ewt1w2Ewtw2= 4η2Lm(f(w¯)f(w))+Ew0w2Ewmw2 4η2Lm(f(w¯)f(w))+w0w2= 4η2Lm(f(w¯)f(w))+w¯w2  (w0=w¯) 4η2Lm(f(w¯)f(w))+2μ(f(w¯)f(w))  (strongly convex)= (4η2Lm+2μ)(f(w¯)f(w))

接下来要对大循环的每一次从w¯ws¯来telescoping, 所以f(w¯)要加上E:

2η(12ηL)mE[f(w¯s)f(w)](4η2Lm+2μ)E[f(w¯)f(w)]

This is linear convergence. faster than GD if Lμ(condition number) is large.

Linear coupling of GD & MD

What do we find in GD? f(wt)f(wt+1)12Lf(wt)2. Therefore, if the gradient is big, GD makes progress. What if the gradient is small?

We want to have another optimization method to make use of a small gradient and still converge to the optimal point.

x,f(x)f(x0)+f(x0),xx0  (convex)

If we have a sequence of queried points, then:

x,f(x)1ki=0t1(f(xi)+f(xi),xxi)f(x¯)+1ki=0t1f(xi),xxi  (convex)f(x¯)f(x)1ki=0t1f(xi),xix

不等号右边的这个东西叫做regret。我们可以观察到,gradient越小,regret越小,这个bound就越紧,达成了我们的目的: make use of small gradient and still convergent to the optimal point.

那这种优化方法也很显然了,就是每次选的点,都得让regret变小。

MD: mirror descent

作为一阶优化问题,我们还是只能沿着导数的方向去找点,但是gradient的超平面会延伸到负无穷,这肯定是我们不希望的,我们希望的是走一小步。GD中我们是如何解决这个问题呢?

xt+1=argminx(f(xt)+f(xt),xxt+12ηxxt2)  (+)=argminx(ηf(xt),xxt+12xxt2)

引入正则项12xxt2, 解刚好就是xt+1=xtηf(xt)

现在考虑对这个正则项一般化,于是就提出了Bregman divergence:

Vx(y)=w(y)w(x),yxw(x)

w(x) is a 1-strongly convex function. So Vx(y)12yx2. (Assume w(x) to be the simplest 1-strongly convex function x22, 正则项刚好就是GD的12xxt2)

The update rule of MD is:

xk+1=argminx(αf(xk),xxk+Vxk(x))

convergence

Lemma: triangle equality

Vx1(x2),x2x3=w(x1)(x2),x2x3=w(x1),x3x1+w(x1),x2x1+w(x2),x3x2(Vx(x)w(x),xx)=(w(x3)w(x1),x3x1w(x1))     (w(x2)w(x1),x2x1w(x1))     (w(x3)+w(x2),x3x2w(x2))  (w(x))=Vx1(x3)Vx1(x2)Vx2(x3)

α(f(xk)f(x))αf(xk),xkx  (convex)=αf(xk),xkxk+1+αf(xk),xk+1x  ()=αf(xk),xkxk+1+Vxk(xk+1),xk+1x(xk+1(αf(xk),xxk+Vxk(x))0)=αf(xk),xkxk+1+Vxk(x)Vxk+1(x)Vxk(xk+1)  (Lemma: triangle equality)=αf(xk),xkxk+1Vxk(xk+1)+Vxk(x)Vxk+1(x)  ()αf(xk),xkxk+112xkxk+12+Vxk(x)Vxk+1(x)  (1-strongly convex)α22f(xk)2+Vxk(x)Vxk+1(x)  (配方,内积优先考虑配方,其次考虑convex)

后面这两项可以telescoping

αT(f(x¯)f(x))α(k=0T1f(xk)Tf(x))α22k=0T1f(xk)2+Vx0(x)VxT(x)α22k=0T1f(xk)2+Vx0(x)f(x¯)f(x)αρ22+Vx0(x)αT  (ρ-smooth)f(x¯)f(x)αρ22+θαT  (bound)

Let α=2θρT

f(x¯)f(x)ρ2θT=ϵT=Ω(ρ2ϵ2)

所以MD是1T的收敛速度

Linear coupling

Can we combine GD and MD?

Wishful thinking, assume f(x) is either always larger than a threshold ρ or smaller than this ρ.

Consider we need to update form f(x0)f(x)2ϵ to f(xT)f(x)ϵ.

  • Run MD, it needs Ω(ρ2/ϵ2)
  • Run GD, it decreases by f2/2Lρ2/2L in each step. So it needs 2Lϵ/ρ2

Combine these two, we know TΩ(max{ρ2/ϵ2,2Lϵ/ρ2}). Tunning ρ2=ϵLϵ . we get TΩ(L/ϵ). Then this gets convergence rate 1T2.

But this is an ideal situation, we rarely handle this kind of function. For an arbitrary function, if GD and MD go to 2 different points, how shall we choose?

This is why we use linear coupling: In every iteration, we run GD and MD simultaneously to get yk and zk respectively.

x0=y0=z0yk=GD(xk)zk=Mirrzk1(αf(xk))=argminyVzk(y)+αf(xk),yzkxk+1=τzk+(1τ)yk

α and τ are hyperparameters to be decided later.

α(f(xk+1)f(x))  αf(xk+1),xk+1x  (convex)=  αf(xk+1),xk+1zk+αf(xk+1),zkx  (拆添项,极值点导数为0,然后三角不等式,再用一下strongly convex,配方一下)  ...+(α22f(xk+1)2+Vzk(x)Vzk+1(x))  (MD)  ...+(α2L(f(xk+1)f(yk+1))+Vzk(x)Vzk+1(x))  (GD)=  (1τ)ατf(xk+1),ykxk+1+(...)  (linear coupling)  (1τ)ατ(f(yk)f(xk+1))+(...)  (convex)

We can choose τ s.t. 1ττ=αL, then: (τ has been fixed, then we only has one freedom α)

α(f(xk+1)f(x))  α2L(f(yk)f(xk+1))+(α2L(f(xk+1)f(yk+1))+Vzk(x)Vzk+1(x))  α2L(f(yk)f(yk+1))+Vzk(x)Vzk+1(x)

Then this can do telescoping!

αT(f(x¯)f(x))  αk=0T1(f(xk)f(x))  (琴生不等式)  α2L(f(y0)f(yT))+Vz0(x)VzT(x)  α2Ld+θf(x¯)f(x)  α2Ld+θαT=1T(Ldα+θα)

We can choose α=θ/Ld, then:

f(x¯)f(x)2LθdT

It seems its converge rate is 1T, but this converge rate is related to d. By restarting from dd2...2ϵϵ.

O(i=0log(d/ϵ)4Lθ2iϵ)=O(4Lθϵi=0log(d/ϵ)(12)i)=O(Lθϵ)

Matrix completion

For a matrix A, some entries are known, while some are unknown. We want to predict the unknown entries. Namely, we want to find a find low rank A so that A matches A best on known entries.

Assumption

  • low rank

  • The known entries are uniformly distributed in their positions.

  • Incoherence (like smooth assumption in unsupervised learning)

    • We hope all coordinates are comparable, rather than only a few of them being too large or too small.
    • Maybe the "gradient" between the entries can not be too sharp.
    • 1n(1,1,...,1) is better than (1,0,...,0)
    • Mathematically:
    • A=UΣVT,URn×r,VRm×reiTUurn,i[n]ejTVurm,j[m]1umin{n,m}/r

    • Here URn×r, VRm×r, (Attention here, we do not need to complete the U and V to be a square matrix )

algorithm

Convex relaxation:

  • How shall we calculate rank(A) naively? A=UΣVT rank(A)=Σ0.
  • But in this way, minimizing rank(A) is hard. Because zero norm is not a good function (not convex).
  • relax to L1-norm: Minimizing A1. or relax to nuclear norm Σ1.

This is equivalent to a convex SDP problem.

However! SDP can not solve a large matrix. Say larger than 100*100.

交替最小二乘法 alternating least square minimization

Use Ω to represent the known entries mask, PΩ(A) means the known entries with unknown positions filled with 0.
Assume the solution is A=UVT

U,V=argminU,VRnrPΩ(UVT)PΩ(A)2

The loss is nonconvex, one can easily find multiple local minimal, like U=U2,V=2V.

Vt+1=argminVPΩ(UtVT)PΩ(A)2Ut+1=argminUPΩ(UVt+1T)PΩ(A)2

Every subproblem is convex and can be solved efficiently.

Non-convex Optimization

Is convex landscape the only good case for GD? No.

Another important case: no-spurious-local-min.

  • no-spurious-local-min: If all the local min are equally good, it suffices to find one of them! (Although nonconvex)

Escape saddle point

For stationary point: L(w)=0:

  • If 2L(w)0, it's a local minimum or a flat saddle point. The latter one means we have no good direction to get out of this saddle point. That is all first-order algorithms will fail.
  • If 2L(w) has both +/ eigenvalues, it's a strict saddle point, which means at least one direction to escape.
  • We call a function "strict saddle" if it does not contain any flat saddle points.

An informal Theorem

SGD will escape all saddle points and local maximum, converge to a
local minimum after a polynomial number of steps based on the following assumption.

  • L(w) is a smooth, bounded and strict saddle.
  • 2L(w) is smooth
  • SGD noise has non-negligible variance in every direction with constant probability

意义: 基于这个定理, 如果所有最小值一样好,且函数满足上面的条件,那么SGD就能收敛到全局最小值。

proof

For the current point 𝑤0, we discuss what happens with SGD.

Case1: f(w0) is big

Since the gradient is big, Noise will not affect the decreasing trend of function values.

Case2: f(w0) is small

subcase1: If we close to a local minimum

A local minimum is a trap. Although we have noise in gradient, we still have very little probability of getting out. 这保证我们在收敛到最小值之后不会离开

subcase2: If we are close to a saddle point

There must be a negative direction according to strict saddle property. Then a random perturbation will give a positive
projection in the negative direction. We will follow this direction and escape the saddle.

Escape local minimum

If the local minimum trap is too wide, then noise makes no difference.

But if the local minimum trap is sharp and narrow, the noise makes sense.

Informal Proof

Update rule:

wt+1=wtη(L(wt)ωt)yt=wtηL(wt)  (ytGD)

So we can get

yt+1=wt+1ηL(wt+1)=ytηωtηL(ytηωt)E[yt+1]=E[yt]ηE[L(ytηωt)]

In expectation, it is like doing GD using an average gradient of the neighborhood, like doing a convolution on gradient.