Chapter 2: Optimization

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

What is optimization? Use algorithms to minimize a function. In machine learning, minimizing a loss function is very important.

GD: Gradient descent

To minimize a function \(f(x)\), we use the following formula to update. (𝜂 is step size/learning rate)

\[x_{t+1} = x_{t} -\eta \nabla f(x_t) \]

Smoothness assumption

(a) \(\Vert \nabla ^2f(x)\Vert \le L\)

(b)\(\Vert \nabla f(x) - \nabla f(y) \Vert \le L\Vert x - y \Vert \xrightarrow{} \langle \nabla f(x) - \nabla f(y), x-y \rangle\le L\Vert x - y \Vert^2\)

(c)\(|f(y)-f(x)-\langle f(x), y-x\rangle| \le \frac{L}{2}\Vert y -x \Vert^2\)

这三者等价。证明如下:

(a) \(\xrightarrow{}\) (b): 也是泰勒展开成积分的形式,因为我们是要利用二阶导的性质得到关于一阶导的性质。

\[\begin{align*} &\Vert \nabla f(w') - \nabla f(w)\Vert \\ = ~&\Vert \int_{0}^{1}\nabla^2f(w+t(w'-w))dt * (w'-w)\Vert~~(泰勒展开)\\ \le ~&\Vert \int_{0}^{1}\nabla^2f(w+t(w'-w))dt\Vert \Vert(w'-w)\Vert~~(内积不等式)\\ \le ~&L\Vert w'-w \Vert~~(条件a)\\ \end{align*} \]

(b) \(\xrightarrow{}\) (c): 套路类似,泰勒展开,利用积分的形式实现降维

\[\begin{align*} &|f(w') - f(w) - \langle \nabla f(w), w'-w \rangle | \\ = ~&|\langle \int_{0}^{1}\nabla f(w+t(w'-w)) dt, w'-w \rangle - \langle \nabla f(w), w'-w \rangle |~~(泰勒展开)\\ = ~&|\int_{0}^{1}\langle \nabla f(w+t(w'-w))-\nabla f(w), w'-w \rangle dt|\\ \le ~&\int_{0}^{1} \Vert \nabla f(w+t(w'-w))-\nabla f(w) \Vert\Vert w'-w \Vert dt~~(内积不等式)\\ \le ~&\int_{0}^{1} L\Vert t(w'-w)\Vert \Vert w'-w \Vert dt~~(条件b)\\ = ~& L\Vert w'-w \Vert^2 \int_{0}^{1} t dt ~=\frac{L}{2}\Vert w'-w \Vert^2 \end{align*} \]

(c) \(\xrightarrow{}\) (a):

(a)泰勒展开

\[\begin{equation*} \frac{1}{2}(w'-w)^T\nabla^2 f(w+\theta(w'-w))(w'-w) \le \frac{L}{2}\Vert w'-w \Vert ^2 \end{equation*} \]

Assume \(w'=w +tv, \Vert v \Vert = 1\)

\[\begin{equation*} v^T\nabla^2 f(w + \theta tv)v \le L \end{equation*} \]

Let \(t \xrightarrow{} 0\)\(w'\xrightarrow{} w\)), we can get $ vT\nabla2 f(w)v \le L$.

\[\begin{equation*} \Vert\nabla^2 f(w)\Vert = \lambda_{max}(\nabla^2 f(w)) = \mathop{\max}_{\Vert v \Vert =1}v^T\nabla^2 f(w)v\le L \end{equation*} \]

Convexity assumption

(a) \(\nabla^2 f(w) \ge 0\)

(b) \(f(w') \ge f(w) + \langle \nabla f(w), w'-w \rangle\)

(c) \(f(tx+((1-t)y) \le tf(x) + (1-t)f(y)\)

GD convergence guarantee

Using these 2 assumptions and the update rule, we can get a GD guarantee.

Lemma: Combine L smooth and update rule, we can get

\[f(w_{i+1}) -f(w_{i}) \le -\frac{\eta}{2}\Vert \nabla f(w_{i})\Vert \]

Proof: Use L smooth condition (c) and the update rule

\[\begin{align*} f(w_{i+1}) &\le f(w_i) + \langle \nabla f(w_i), -\eta \nabla f(w_i) \rangle + \frac{L\eta^2}{2}\Vert \nabla f(w_i) \Vert^2 \\ &= f(w_i) + \eta(1-\frac{L\eta}{2})\Vert \nabla f(w_i) \Vert^2 \end{align*} \]

set learning rate \(\eta = \frac{1}{L}\), then we can get:

\[\begin{equation*} \frac{1}{2L}\Vert \nabla f(w_i) \Vert^2 = \frac{\eta}{2}\Vert \nabla f(w_i) \Vert^2 \le f(w_i) - f(w_{i+1}) \le f(w_i) -f(w^*) \end{equation*} \]

begin with the lemma

\[\begin{align*} & f(w_{i+1}) \le f(w_i) - \frac{\eta}{2}\Vert \nabla f(w_i)\Vert ~~(\nabla f(w_i)\text{没法telescoping})\\ = ~&f(w_i) - \frac{1}{2\eta}\Vert w_i - w_{i+1}\Vert^2~~(\text{update rule})\\ \le ~&f(w^*) - \langle \nabla f(w_i), w^*-w_i \rangle- \frac{1}{2\eta}\Vert w_i - w_{i+1}\Vert^2 ~~(\text{convex})\\ = ~&f(w^*) - \frac{1}{\eta}\langle w_i - w_{i+1}, w^*-w_i \rangle- \frac{1}{2\eta}\Vert w_i - w_{i+1}\Vert^2~~(\text{update rule})\\ = ~& f(w^*) -\frac{1}{2\eta}(\Vert w_i - w_{i+1}\Vert^2 - 2 \langle w_i - w_{i+1}, w_i-w^*\rangle) ~~(这里就比较明显了)\\ = ~& f(w^*) -\frac{1}{2\eta}(\Vert w_i - w_{i+1}\Vert^2 - 2 \langle w_i - w_{i+1}, w_i-w^*\rangle + \Vert w_i -w^*\Vert^2) + \frac{1}{2\eta}\Vert w_i -w^*\Vert^2\\ = ~& f(w^*) + \frac{1}{2\eta}\Vert w_i -w^*\Vert^2 - \frac{1}{2\eta}\Vert w_{i+1} -w^*\Vert^2 \end{align*} \]

然后开始telescoping。第一个不等号是因为non increasing

\[T(f(w_T)-f(w^*)) \le \sum_{i=0}^{T-1} f(w_{i+1}) - Tf(w^*) \le \frac{\Vert w_0 -w^*\Vert^2 - \Vert w_T -w^*\Vert^2 }{2\eta} \le \frac{\Vert w_0 -w^*\Vert^2 }{2\eta} \]

\(\mu\)-strongly convex assumption

(a) \(\Vert \nabla^2 f(x) \Vert \ge \mu\)

(b) \(\langle \nabla f(x) - \nabla f(y),x-y \rangle \ge \mu\Vert x - y \Vert^2 \xrightarrow{} \Vert \nabla f(x) - \nabla f(y) \Vert \ge \mu\Vert x - y \Vert\)

(c) \(f(y)-f(x)-\langle f(x), y-x\rangle \ge \frac{\mu}{2}\Vert y -x \Vert^2\)

和前面L - smooth类似,二阶导,导数,原函数的性质各一条。证明也类似,此处略过。

GD can get a linear convergence rate on an L-smooth \(\mu\)-strongly convex function.

Then let's move on to the proof of the question. Set learning rate \(\eta = \frac{1}{L}\).

\[\begin{align*} \Vert w_{T+1} - w^*\Vert^2 &= \Vert w_T - \eta \nabla f(w_T) - w^*\Vert^2 \\ &= \Vert w_T - w^*\Vert^2 - 2\eta \langle \nabla f(w_T), w_T - w^* \rangle + \eta^2 \Vert \nabla f(w_T)\Vert\\ &\le \Vert w_T - w^*\Vert^2 + 2\eta \langle \nabla f(w_T), w^* - w_T \rangle + 2\eta(f(w_T) - f(w^*)) ~~(\text{lemma})\\ &\le \Vert w_T - w^*\Vert^2 + 2\eta( f(w^*)-f(w_T)-\frac{\mu}{2}\Vert w_T - w^*\Vert^2)+ 2\eta(f(w_T) - f(w^*)) ~~(\text{strongly convex})\\ &= (1 - \frac{\mu}{L})\Vert w_T - w^*\Vert^2 \end{align*} \]

\[\begin{align*} \Vert w_T - w^*\Vert^2 &\le (1-\frac{\mu}{L})^T \Vert w_0 - w^*\Vert^2~~\text{(telescoping)}\\ f(w_T) - f(w^*) &\le \langle \nabla f(w^*), w_T - w* \rangle + \frac{L}{2}\Vert w_T - w^*\Vert^2\\ &= \frac{L}{2}\Vert w_T - w^*\Vert^2~~(因为\nabla f(w^*) = 0)\\ &\le \frac{L}{2} (1-\frac{\mu}{L})^T \Vert w_0 - w^*\Vert^2\\ \end{align*} \]

SGD: Stochastic gradient descent

but gradient descent has 2 limitations:

  • Computing full gradient is slow for big data
  • Could get stuck at stationary points (stationary point & local minimum)

So we add some randomness to the GD algorithm

\[\begin{align*} x_{t+1} &= x_{t} -\eta \nabla G_t \\ \frac{1}{\eta}E[x_t - x_{t+1}] = E[G_t] &= \nabla L(w_t) = \frac{1}{N}\sum_i \nabla L(w, x_i, y_i)\\ V(G_t) &= E[\Vert G_t\Vert^2] - \Vert \nabla L(w_t) \Vert^2 ~~(\text{方差的定义})\\ G_t &= \frac{1}{|S|}\sum_{i\in S} \nabla L(w, x_i, y_i) ~~\text{for a randomly selected subset S} \end{align*} \]

\(|S|\) is called the mini-batch size, or simply batch size.

  • If \(|S|\) is too small, say, \(|S|\)=1, \(G_t\) still has the right expectation, but with very large variance
  • If \(|S|\) is too big, say, \(|S|\)=n, \(G_t\) is slow to compute
  • Small but not too small \(|S|\) has at least two benefits
    • Has some noise to help escape saddle points or bad local minimum
    • Helps to get the right mini-batch statistics for batch normalization (不懂)

If we consider update rule of SGD as \(x_{t+1} = x_{t} -\eta (\nabla L(w_t) + \xi_t)\). Full batch + white noise is also SGD.

We run SGD with bounded variance \(Var(G_t) \le \sigma^2\)

convergence analysis

because the update rule is different, so we can not begin with the lemma before. We need to propose a new lemma.

Lemma: combine L smooth and the update rule, then we can get

\[E[f(w_{i+1})] \le f(w_i) - \frac{\eta}{2}\Vert \nabla f(w_i)\Vert^2 + \frac{\eta\sigma^2}{2} \]

Proof: Use L smooth condition (c) and the update rule

\[\begin{align*} & E[f(w_{i+1})] \le E[f(w_i) + \langle \nabla f(w_i), w_{i+1} - w_i \rangle + \frac{L}{2}\Vert w_{i+1}-w_i\Vert^2]\\ = ~& f(w_i) + \langle \nabla f(w_i), E[(w_{i+1} - w_i)] \rangle + \frac{L}{2}E[\Vert w_{i+1}-w_i\Vert^2]~~(期望的线性性质)\\ = ~&f(w_i) - \eta\Vert \nabla f(w_i)\Vert^2 + \frac{\eta}{2}E[\Vert G_i\Vert^2]~~(\text{update rule})\\ = ~&f(w_i) - \eta\Vert \nabla f(w_i)\Vert^2 + \frac{\eta}{2}(V(G_i)+\Vert \nabla f(w_i)\Vert^2)~~(\text{方差的定义})\\ \le ~& f(w_i) - \frac{\eta}{2}\Vert \nabla f(w_i)\Vert^2 + \frac{\eta\sigma^2}{2} ~~(\text{bounded方差的条件}) \end{align*} \]

SGD convergence guarantee

begin with \(E[f(w_{i+1})] \le f(w_i) - \eta\Vert \nabla f(w_i)\Vert^2 + \frac{\eta}{2}E[\Vert G_i\Vert^2]\) in the proof of lemma

\[\begin{align*} E[f(w_{i+1})] \le~& f(w_i) - \eta\Vert \nabla f(w_i)\Vert^2 + \frac{\eta}{2}E[\Vert G_i\Vert^2] \\ \le ~& f(w_i) - \frac{\eta}{2}E[\Vert G_i\Vert^2] + \eta\sigma^2 ~~(\text{方差的定义})\\ \le ~& f(w^*) - \langle \nabla f(w_i), w^*-w_i \rangle - \frac{\eta}{2}E[\Vert G_i\Vert^2] + \eta\sigma^2 ~~(\text{convex})\\ = ~& f(w^*) - \langle \frac{1}{\eta} E[(w_i - w_{i+1})], w^*-w_i \rangle - \frac{1}{2\eta}E[\Vert w_i - w_{i+1} \Vert^2] + \eta\sigma^2(\text{update rule})\\ = ~& f(w^*) -\frac{1}{2\eta}E(\Vert w_i - w_{i+1} \Vert^2 -2\langle w_i - w_{i+1}, w_i - w^*\rangle ) + \eta\sigma^2 ~~(这里就比较明显了)\\ = ~& f(w^*) -\frac{1}{2\eta}E(\Vert w_i - w_{i+1} \Vert^2 -2\langle w_i - w_{i+1}, w_i - w^*\rangle + \Vert w_i-w^*\Vert^2 - \Vert w_i-w^*\Vert^2)+ \eta\sigma^2\\ = ~& f(w^*) + \frac{1}{2\eta}E(\Vert w_i-w^*\Vert^2 -\Vert w_{i+1}-w^*\Vert^2) + \eta\sigma^2 \end{align*} \]

然后开始telescoping

\[\begin{align*} \sum_{i=0}^{T-1} E[f(w_{i+1})] - Tf(w^*) &\le \frac{1}{2\eta} E(\Vert w_0-w^*\Vert^2 -\Vert w_{T}-w^*\Vert^2) + T\eta\sigma^2 \le \frac{1}{2\eta} \Vert w_0-w^*\Vert^2 + T\eta\sigma^2 \\ E[f(\bar{w})] - f(w^*)\le E~[\frac{1}{T}\sum_{i=0}^{T-1}f(w_{i+1})] - f(w^*) &\le \frac{\Vert w_0-w^*\Vert^2}{2\eta T} + \eta\sigma^2~~(\text{琴生不等式}) \end{align*} \]

convergence rate \(\frac{1}{\sqrt{T}}\)

proof: set \(\eta = \frac{\epsilon}{2\sigma^2}\)

\[E[f(\bar{w})] - f(w^*)\le \frac{\Vert w_0-w^*\Vert^2 \sigma^2}{\epsilon T} +\frac{\epsilon}{2} ~~(= \epsilon) \]

So \(T = \frac{2\Vert w_0-w^*\Vert^2 \sigma^2}{\epsilon^2}\). so convergence rate is \(\frac{1}{\sqrt{T}}\)

Thinking: It seems SGD is much slower than GD, why do we still use it?

  • Noise 可能没有那么大
  • Sometimes noise helps, especially in non-convex settings (we will see it later)
  • \(\epsilon\) 不需要到那么小: We do not care about training loss, but care about population loss. Even with 𝜖 = 0.1, it does not guarantee the population loss will be better compared with 𝜖 =0.1
  • The variance of the noise can be reduced gradually along the process. (therefore as fast as GD)

If the variance \(\sigma\) is the problem, can we make it smaller to have a faster convergence rate?

The answer is yes, we could reduce the variance and get $ \frac{1}{𝑇}$ rate for convex function, and linear convergence for strongly convex function.

Unfortunately, it seems to only work for convex function, not useful for deep learning (yet).

There are many such algorithms, i.e., SAG, SVRG, SAGA, etc.

SVRG

assumption: L-smooth and \(\mu\)-strongly convex

Update rule

For \(s = 1, 2, ..., T, ...\)
\(\bar{w} = \bar{w}_{s-1}\) init the start point of this step

\(\nabla f(\bar{w}) = \frac{1}{N}\sum _{i=1}^N\nabla l_i(\bar{w})\) (calucalte full gradient of the start point)

\(w_0 = \bar{w}\) init the start point of the micro for loop below

​ For t = 1, 2, ..., m

​ Random pick \(i \in [N]\) 随机选取一条数据的\(l_i\)计算gradient

\(w_t = w_{t-1} -\eta(\nabla l_i(w_{t-1}) - \nabla l_i(\bar{w}) + \nabla f(\bar{w}))\) 用single gradient来更新

\(\bar{w}_s =w_m\) 小循环结束之后,更新终点作为下一step的起点

备注,在下面证明的时候,\(\bar{w_s}=w_{rand(t)}\)。但是in practical 是上面那种做法

thinking

We compute the full gradient for every step. So although a full gradient is expensive, the amortized cost is small.

We assume $ \nabla f(w_t) -\nabla f(\bar{w})\approx \nabla l_i(w_t) - \nabla l_i(\bar{w}) $

Convergence analysis

先研究小循环的一步,会造成什么影响

\[\begin{align*} E\Vert w_t - w^*\Vert^2 &= E\Vert w_{t-1} - \eta u_t - w^*\Vert^2~~(\text{update rule})\\ &= \Vert w_{t-1}- w^*\Vert^2 - 2\langle w_{t-1}- w^*, \eta E[u_t] \rangle + E[\Vert u_t\Vert^2] \end{align*} \]

小循环中的 \(i\) 是随机的,那么要去掉这个随机,那就是用\(E\)。加上\(E\)之后我们操作空间就大了

\(E[u_t] = E[(\nabla l_i(w_{t-1}) - \nabla l_i(\bar{w}) + \nabla f(\bar{w}))]= E[\nabla l_i(w_{t-1})] - E[\nabla l_i(\bar{w})] + \nabla f(\bar{w})= \nabla f(w_{t-1})\)

至于\(E[\Vert u_t\Vert^2]\) 单独算一下,目标: 找到一个不含E的上界

\[\begin{align*} E[\Vert u_t\Vert^2] &= E[\Vert\nabla l_i(w_{t-1}) - \nabla l_i(\bar{w}) + \nabla f(\bar{w})\Vert^2] \\ &\le 2E[\Vert\nabla l_i(w_{t-1}) - \nabla l_i(w^*)\Vert^2] + 2E[\Vert \nabla l_i(\bar{w})-\nabla l_i(w^*)- \nabla f(\bar{w})\Vert^2]~~(引入了w^*,均值不等式)\\ &= 2E[\Vert\nabla l_i(w_{t-1}) - \nabla l_i(w^*)\Vert^2] + 2E[\Vert \nabla l_i(\bar{w})-\nabla l_i(w^*)- E[\nabla l_i(\bar{w})]\Vert^2]\\ &~~~~~~(因为其他都是\text{single gradient}, 所以将这个\text{full gradient}转化为\text{single gradient}或者想办法去掉)\\ &= 2E[\Vert\nabla l_i(w_{t-1}) - \nabla l_i(w^*)\Vert^2] + 2E[\Vert \nabla l_i(\bar{w})-\nabla l_i(w^*)- E[\nabla l_i(\bar{w}) -\nabla l_i(w^*) ]\Vert^2]\\ &~~~~~~(E[\nabla l_i(w^*)] = \nabla f(w^*) = 0所以可以随意添项)\\ &\le 2E[\Vert\nabla l_i(w_{t-1}) - \nabla l_i(w^*)\Vert^2] + 2E[\Vert \nabla l_i(\bar{w})-\nabla l_i(w^*)\Vert^2]\\ &~~~~~~(V(X) = E[\Vert X-EX\Vert^2] = E[X^2] - (EX)^2 \le E[X^2], 到此为止,总算去掉了\text{full gradient}的那一项)\\ &\le 4L (f(w_{t-1}) - f(w^*) + f(\bar{w}) - f(w^*)) \end{align*} \]

(最后一步使用的结论\(E[\Vert \nabla l_i(w) - \nabla l_i(w^*) \Vert] \le 2L (f(w) - f(w^*))\),见sys博客[机器学习] 2. 随机方差缩减梯度下降 SVRG - shiys22 - 博客园 (cnblogs.com),证这个结论需要L-smooth)

以上的放缩都没有用到smooth和convex的条件。

现在回头代入之前的式子

\[\begin{align*} E\Vert w_t - w^*\Vert^2 &= \Vert w_{t-1}- w^*\Vert^2 - 2\langle w_{t-1}- w^*, \eta E[u_t] \rangle + \eta^2E[\Vert u_t\Vert^2] \\ &\le \Vert w_{t-1}- w^*\Vert^2 - 2\eta \langle w_{t-1}- w^*, \nabla f(w_{t-1}) \rangle + 4\eta^2 L (f(w_{t-1}) - f(w^*) + f(\bar{w}) - f(w^*))\\ &\le \Vert w_{t-1}- w^*\Vert^2 - 2\eta (f(w_{t-1})-f(w^*))+ 4\eta^2 L (f(w_{t-1}) - f(w^*) + f(\bar{w}) - f(w^*))\\ &~~(没法配方的内积都得想办法消灭, \text{convex})\\ &= \Vert w_{t-1}- w^*\Vert^2 +2\eta(2\eta L-1) (f(w_{t-1})-f(w^*))+ 4\eta^2 L (f(\bar{w}) - f(w^*))\\ 2\eta(1- 2\eta L) (f(w_{t-1})-f(w^*)) &\le 4\eta^2 L (f(\bar{w}) - f(w^*)) + \Vert w_{t-1}- w^*\Vert^2 - E\Vert w_t - w^*\Vert^2 \end{align*} \]

That is \(E\Vert w_t - w^*\Vert^2 - E[\Vert w_{t-1}- w^*\Vert^2] \le \text{xxx}\),显然可以telescoping,当然是对于这一整个小循环来telescoping

在这里会有一些添加\(E\) 的操作,我简单说一下。

前面的证明考虑的是,小循环中的一步\(w_{t-1} \xrightarrow{} w_t\),所以前面\(\Vert w_{t-1}- w^*\Vert^2\)没有加\(E\)。但是现在telescoping考虑整个小循环,就要加上\(E\)

但是对于一整个小循环来说\(f(\bar{w})\)是一开始init的时候就固定好了的。

左侧累加

\[\sum_{t=1}^m f(w_{t-1})-f(w^*) = m(E[f(\bar{w}_s)]-f(w^*)) = mE[f(\bar{w}_s)-f(w^*)] \]

右侧累加

\[\begin{align*} &\sum_{t=1}^m 4\eta^2 L (f(\bar{w}) - f(w^*)) + E\Vert w_{t-1}- w^*\Vert^2 - E\Vert w_t - w^*\Vert^2\\ =~ &4\eta^2 Lm(f(\bar{w}) - f(w^*)) + E\Vert w_0- w^*\Vert^2 - E\Vert w_m - w^*\Vert^2\\ \le~ & 4\eta^2 Lm(f(\bar{w}) - f(w^*)) + \Vert w_0- w^*\Vert^2\\ =~ &4\eta^2 Lm(f(\bar{w}) - f(w^*)) + \Vert \bar{w}- w^*\Vert^2~~(w_0 = \bar{w})\\ \le~ &4\eta^2 Lm(f(\bar{w}) - f(w^*)) +\frac{2}{\mu}(f(\bar{w})-f(w^*))~~(\text{strongly convex})\\ =~ &(4\eta^2 Lm+\frac{2}{\mu}) (f(\bar{w})-f(w^*)) \end{align*} \]

接下来要对大循环的每一次从\(\bar{w} \xrightarrow{} \bar{w_s}\)来telescoping, 所以\(f(\bar{w})\)要加上E:

\[\begin{align*} 2\eta(1- 2\eta L)mE[f(\bar{w}_s)-f(w^*)] \le (4\eta^2 Lm+\frac{2}{\mu}) E[f(\bar{w})-f(w^*)] \end{align*} \]

This is linear convergence. faster than GD if \(\frac{L}{\mu}\)(condition number) is large.

Linear coupling of GD & MD

What do we find in GD? \(f(w_t) - f(w_{t+1})\ge \frac{1}{2L}\Vert \nabla f(w_t)\Vert ^2\). Therefore, if the gradient is big, GD makes progress. What if the gradient is small?

We want to have another optimization method to make use of a small gradient and still converge to the optimal point.

\[\forall x, f(x) \ge f(x_0) + \langle \nabla f(x_0), x -x_0\rangle ~~(\text{convex}) \]

If we have a sequence of queried points, then:

\[\begin{align*} \forall x, f(x) &\ge \frac{1}{k}\sum_{i=0}^{t-1} (f(x_i) + \langle \nabla f(x_i), x -x_i\rangle) \\ &\ge f(\bar{x}) + \frac{1}{k}\sum_{i=0}^{t-1} \langle \nabla f(x_i), x -x_i\rangle~~(\text{convex})\\ f(\bar{x}) - f(x^*) &\le \frac{1}{k}\sum_{i=0}^{t-1} \langle \nabla f(x_i), x_i - x^*\rangle \end{align*} \]

不等号右边的这个东西叫做regret。我们可以观察到,gradient越小,regret越小,这个bound就越紧,达成了我们的目的: make use of small gradient and still convergent to the optimal point.

那这种优化方法也很显然了,就是每次选的点,都得让regret变小。

MD: mirror descent

作为一阶优化问题,我们还是只能沿着导数的方向去找点,但是gradient的超平面会延伸到负无穷,这肯定是我们不希望的,我们希望的是走一小步。GD中我们是如何解决这个问题呢?

\[\begin{align*} x_{t+1} &= \mathop{\arg\min}_{x} (f(x_t) + \langle \nabla f(x_t), x- x_t\rangle + \frac{1}{2\eta} \Vert x-x_{t}\Vert^2)~~(超平面+正则项)\\ &= \mathop{\arg\min}_{x} (\eta \langle \nabla f(x_t), x- x_t\rangle + \frac{1}{2} \Vert x-x_{t}\Vert^2)\\ \end{align*} \]

引入正则项\(\frac{1}{2} \Vert x-x_{t}\Vert^2\), 解刚好就是\(x_{t+1} = x_t - \eta \nabla f(x_t)\)

现在考虑对这个正则项一般化,于是就提出了Bregman divergence:

\[V_x(y) = w(y) - \langle \nabla w(x), y-x \rangle - w(x) \]

\(w(x)\) is a 1-strongly convex function. So \(V_x(y) \ge \frac{1}{2}\Vert y-x\Vert^2\). (Assume \(w(x)\) to be the simplest 1-strongly convex function \(\frac{x^2}{2}\), 正则项刚好就是GD的\(\frac{1}{2} \Vert x-x_{t}\Vert^2\))

The update rule of MD is:

\[x_{k+1} = \mathop{\arg\min}_{x}(\alpha \langle\nabla f(x_k), x-x_k \rangle + V_{x_k}(x)) \]

convergence

Lemma: triangle equality

\[\begin{align*} \langle -\nabla V_{x_1}(x_2), x_2 - x_3\rangle &= \langle \nabla w(x_1) - \nabla (x_2), x_2 - x_3\rangle \\ &= - \langle \nabla w(x_1), x_3 - x_1\rangle + \langle \nabla w(x_1), x_2 - x_1\rangle + \langle \nabla w(x_2), x_3 - x_2 \rangle\\ & (这一步是比较明显的,因为要凑出V_x(x')的这种形式,那就是要先把内积变成\langle \nabla w(x), x' - x\rangle 的形式)\\ &= (w(x_3) - \langle \nabla w(x_1), x_3 - x_1\rangle -w(x_1)) \\ &~~~~~- (w(x_2) - \langle \nabla w(x_1), x_2 - x_1\rangle -w(x_1))\\ &~~~~~-( w(x_3) + \langle \nabla w(x_2), x_3 - x_2 \rangle -w(x_2))~~(w(x)添加的项刚好全部抵消)\\ &= V_{x_1}(x_3) - V_{x_1}(x_2) - V_{x_2}(x_3) \end{align*} \]

\[\begin{align*} \alpha(f(x_k) - f(x^*)) &\le \alpha \langle \nabla f(x_k), x_k - x^* \rangle ~~ (\text{convex})\\ &= \langle \alpha\nabla f(x_k), x_k - x_{k+1}\rangle + \langle \alpha\nabla f(x_k), x_{k+1}-x^*\rangle~~(添项,因为特殊点就这三个)\\ &= \langle \alpha\nabla f(x_k), x_k - x_{k+1}\rangle + \langle -\nabla V_{x_k}(x_{k+1}), x_{k+1}-x^* \rangle\\ &(因为x_{k+1}是(\alpha \langle\nabla f(x_k), x-x_k \rangle + V_{x_k}(x))的极小值点,所以求导为0)\\ &= \langle \alpha\nabla f(x_k), x_k - x_{k+1}\rangle + V_{x_{k}}(x^*) - V_{x_{k+1}}(x^*) -V_{x_{k}}(x_{k+1}) ~~(\text{Lemma: triangle equality})\\ &= \langle \alpha\nabla f(x_k), x_k - x_{k+1}\rangle -V_{x_{k}}(x_{k+1}) + V_{x_{k}}(x^*) - V_{x_{k+1}}(x^*) ~~(把相同变量的放一起)\\ &\le \langle \alpha\nabla f(x_k), x_k - x_{k+1}\rangle -\frac{1}{2}\Vert x_k - x_{k+1}\Vert^2 + V_{x_{k}}(x^*) - V_{x_{k+1}}(x^*) ~~ (\text{1-strongly convex})\\ &\le \frac{\alpha^2}{2}\Vert \nabla f(x_k)\Vert^2 + V_{x_{k}}(x^*) - V_{x_{k+1}}(x^*)~~\text{(配方,内积优先考虑配方,其次考虑convex)}\\ \end{align*} \]

后面这两项可以telescoping

\[\begin{align*} \alpha T(f(\bar{x}) - f(x^*)) \le \alpha(\sum_{k=0}^{T-1}f(x_k) - Tf(x^*)) &\le \frac{\alpha^2}{2}\sum_{k=0}^{T-1}\Vert \nabla f(x_k)\Vert^2 + V_{x_{0}}(x^*) - V_{x_{T}}(x^*) \le \frac{\alpha^2}{2}\sum_{k=0}^{T-1}\Vert \nabla f(x_k)\Vert^2 + V_{x_{0}}(x^*)\\ f(\bar{x}) - f(x^*) &\le \frac{\alpha\rho^2}{2} + \frac{V_{x_{0}}(x^*)}{\alpha T} ~~(\rho\text{-smooth})\\ f(\bar{x}) - f(x^*) &\le \frac{\alpha\rho^2}{2} + \frac{\theta}{\alpha T} ~~(\text{bound})\\ \end{align*} \]

Let \(\alpha = \frac{\sqrt{2\theta}}{\rho\sqrt{T}}\)

\[\begin{align*} f(\bar{x}) - f(x^*) &\le \frac{\rho\sqrt{2\theta}}{\sqrt{T}} = \epsilon\\ T &= \Omega(\frac{\rho^2}{\epsilon^2}) \end{align*} \]

所以MD是\(\frac{1}{\sqrt{T}}\)的收敛速度

Linear coupling

Can we combine GD and MD?

Wishful thinking, assume \(\Vert \nabla f(x)\Vert\) is either always larger than a threshold \(\rho\) or smaller than this \(\rho\).

Consider we need to update form \(f(x_0) - f(x^*) \le 2\epsilon\) to \(f(x_T) - f(x^*) \le \epsilon\).

  • Run MD, it needs \(\Omega(\rho^2/\epsilon^2)\)
  • Run GD, it decreases by \(\Vert\nabla f\Vert^2/2L \ge \rho^2/2L\) in each step. So it needs \(2L\epsilon/\rho^2\)

Combine these two, we know \(T\ge\Omega(\max\{\rho^2/\epsilon^2, 2L\epsilon/\rho^2\})\). Tunning \(\rho^2 = \epsilon\sqrt{L\epsilon}\) . we get \(T\ge \Omega (\sqrt{L/\epsilon})\). Then this gets convergence rate \(\frac{1}{T^2}\).

But this is an ideal situation, we rarely handle this kind of function. For an arbitrary function, if GD and MD go to 2 different points, how shall we choose?

This is why we use linear coupling: In every iteration, we run GD and MD simultaneously to get \(y_k\) and \(z_k\) respectively.

\[\begin{align*} x_0 &= y_0 = z_0\\ y_k &= GD(x_k)\\ z_k &= Mirr_{z_{k-1}}(\alpha \nabla f(x_k)) = \arg\min_y {V_{z_k}(y) + \langle\alpha \nabla f(x_k) , y - z_k\rangle }\\ x_{k+1} &= \tau z_{k} + (1-\tau) y_k \end{align*} \]

\(\alpha\) and \(\tau\) are hyperparameters to be decided later.

\[\begin{align*} \alpha (f(x_{k+1}) - f(x^*)) \le &~~ \alpha \langle \nabla f(x_{k+1}), x_{k+1} - x^* \rangle ~~ (\text{convex})\\ = &~~ \alpha \langle \nabla f(x_{k+1}), x_{k+1} - z_k \rangle + \alpha \langle \nabla f(x_{k+1}), z_{k} - x^* \rangle\\&~~\text{(拆添项,极值点导数为0,然后三角不等式,再用一下strongly convex,配方一下)}\\ \le &~~ \text{...} + (\frac{\alpha^2}{2}\Vert \nabla f(x_{k+1})\Vert^2 + V_{z_{k}}(x^*) - V_{z_{k+1}}(x^*)) ~~(\text{MD})\\ \le &~~ \text{...} + \Big(\alpha^2L \big(f(x_{k+1}) - f(y_{k+1}) \big)+ V_{z_{k}}(x^*) - V_{z_{k+1}}(x^*)\Big)~~(\text{GD})\\ = &~~\frac{(1-\tau)\alpha}{\tau} \langle \nabla f(x_{k+1}), y_k - x_{k+1} \rangle + (\text{...})~~(\text{linear coupling})\\ \le&~~\frac{(1-\tau)\alpha}{\tau} (f(y_k) - f(x_{k+1})) + (\text{...})~~(\text{convex}) \end{align*} \]

We can choose \(\tau\) s.t. \(\frac{1-\tau}{\tau} = \alpha L\), then: (\(\tau\) has been fixed, then we only has one freedom \(\alpha\))

\[\begin{align*} \alpha (f(x_{k+1}) - f(x^*)) \le &~~ \alpha^2L (f(y_k) - f(x_{k+1})) + (\alpha^2L (f(x_{k+1}) - f(y_{k+1}))+ V_{z_{k}}(x^*) - V_{z_{k+1}}(x^*))\\ \le &~~ \alpha^2L (f(y_{k}) - f(y_{k+1}))+ V_{z_{k}}(x^*) - V_{z_{k+1}}(x^*) \end{align*} \]

Then this can do telescoping!

\[\begin{align*} \alpha T (f(\bar{x}) - f(x^*)) \le& ~~ \alpha \sum_{k=0}^{T-1}(f(x_{k}) - f(x^*))~~ \text{(琴生不等式)}\\ \le &~~ \alpha^2L (f(y_{0}) - f(y_{T}))+ V_{z_{0}}(x^*) - V_{z_{T}}(x^*)\\ \le &~~ \alpha^2 Ld + \theta\\ f(\bar{x}) - f(x^*) \le &~~ \frac{\alpha^2 Ld + \theta}{\alpha T} = \frac{1}{T} (Ld\alpha+\frac{\theta}{\alpha})\\ \end{align*} \]

We can choose \(\alpha = \sqrt{\theta/Ld}\), then:

\[f(\bar{x}) - f(x^*) \le \frac{2\sqrt{L\theta d}}{T} \]

It seems its converge rate is \(\frac{1}{T}\), but this converge rate is related to \(d\). By restarting from \(d \xrightarrow{} \frac{d}{2} \xrightarrow{} ... \xrightarrow{} 2\epsilon \xrightarrow{} \epsilon\).

\[O(\sum_{i=0}^{log(d/\epsilon)}4\sqrt{\frac{L\theta}{2^i\epsilon}}) = O(4\sqrt{\frac{L\theta}{\epsilon}}\sum_{i=0}^{log(d/\epsilon)}(\frac{1}{\sqrt{2}})^i) = O(\sqrt{\frac{L\theta}{\epsilon}}) \]

Matrix completion

For a matrix \(A\), some entries are known, while some are unknown. We want to predict the unknown entries. Namely, we want to find a find low rank \(A'\) so that \(A'\) matches A best on known entries.

Assumption

  • low rank

  • The known entries are uniformly distributed in their positions.

  • Incoherence (like smooth assumption in unsupervised learning)

    • We hope all coordinates are comparable, rather than only a few of them being too large or too small.
    • Maybe the "gradient" between the entries can not be too sharp.
    • \(\frac{1}{\sqrt{n}}(1, 1, ..., 1)\) is better than \((1, 0, ..., 0)\)
    • Mathematically:
    • \[\begin{align*} A =U\Sigma V^T &, U \in \mathbb{R}^{n \times r}, V \in \mathbb{R}^{m \times r}\\ \|e_i^TU\| &\leq \sqrt{\frac{ur}{n}}, \forall i \in [n]\\ \|e_j^TV\| &\leq \sqrt{\frac{ur}{m}}, \forall j \in [m]\\ 1 \leq u &\leq \min\{n, m\}/r \end{align*} \]

    • Here \(U \in \mathbb{R}^{n \times r}\), \(V \in \mathbb{R}^{m \times r}\), (Attention here, we do not need to complete the \(U\) and \(V\) to be a square matrix )

algorithm

Convex relaxation:

  • How shall we calculate rank(A) naively? \(A = U\Sigma V^T\) \(rank(A) = \Vert \Sigma\Vert_0\).
  • But in this way, minimizing rank(A) is hard. Because zero norm is not a good function (not convex).
  • relax to L1-norm: Minimizing \(\Vert A \Vert_1\). or relax to nuclear norm \(\Vert \Sigma \Vert_1\).

This is equivalent to a convex SDP problem.

However! SDP can not solve a large matrix. Say larger than 100*100.

交替最小二乘法 alternating least square minimization

Use \(\Omega\) to represent the known entries mask, \(P_\Omega(A)\) means the known entries with unknown positions filled with 0.
Assume the solution is \(A = UV^T\)

\[U, V = \mathop{\arg\min}_{U, V\in R^{n*r}}\Vert P_\Omega(UV^T) - P_\Omega(A)\Vert^2 \]

The loss is nonconvex, one can easily find multiple local minimal, like \(U' = \frac{U}{2}, V' = 2V\).

\[\begin{align*} V_{t+1} &= \mathop{\arg\min}_{V} \Vert P_\Omega(U_tV^T) - P_\Omega(A)\Vert^2\\ U_{t+1} &= \mathop{\arg\min}_{U} \Vert P_\Omega(UV_{t+1}^T) - P_\Omega(A)\Vert^2 \end{align*} \]

Every subproblem is convex and can be solved efficiently.

Non-convex Optimization

Is convex landscape the only good case for GD? No.

Another important case: no-spurious-local-min.

  • no-spurious-local-min: If all the local min are equally good, it suffices to find one of them! (Although nonconvex)

Escape saddle point

For stationary point: \(\nabla L(w) = 0\):

  • If \(\nabla^2 L(w) \ge 0\), it's a local minimum or a flat saddle point. The latter one means we have no good direction to get out of this saddle point. That is all first-order algorithms will fail.
  • If \(\nabla^2 L(w)\) has both \(+/-\) eigenvalues, it's a strict saddle point, which means at least one direction to escape.
  • We call a function "strict saddle" if it does not contain any flat saddle points.

An informal Theorem

SGD will escape all saddle points and local maximum, converge to a
local minimum after a polynomial number of steps based on the following assumption.

  • \(L(w)\) is a smooth, bounded and strict saddle.
  • \(\nabla^2 L(w)\) is smooth
  • SGD noise has non-negligible variance in every direction with constant probability

意义: 基于这个定理, 如果所有最小值一样好,且函数满足上面的条件,那么SGD就能收敛到全局最小值。

proof

For the current point \(𝑤_0\), we discuss what happens with SGD.

Case1: \(\Vert \nabla f(w_0)\Vert\) is big

Since the gradient is big, Noise will not affect the decreasing trend of function values.

Case2: \(\Vert \nabla f(w_0)\Vert\) is small

subcase1: If we close to a local minimum

A local minimum is a trap. Although we have noise in gradient, we still have very little probability of getting out. 这保证我们在收敛到最小值之后不会离开

subcase2: If we are close to a saddle point

There must be a negative direction according to strict saddle property. Then a random perturbation will give a positive
projection in the negative direction. We will follow this direction and escape the saddle.

Escape local minimum

If the local minimum trap is too wide, then noise makes no difference.

But if the local minimum trap is sharp and narrow, the noise makes sense.

Informal Proof

Update rule:

\[\begin{align*} w_{t+1} &= w_t - \eta(\nabla L(w_t) - \omega_t)\\ y_{t} &= w_t - \eta\nabla L(w_t)~~(y_t是GD的点) \end{align*} \]

So we can get

\[\begin{align*} y_{t+1} &= w_{t+1} - \eta\nabla L(w_{t+1}) = y_t -\eta\omega_t - \eta\nabla L(y_t -\eta\omega_t)\\ E[y_{t+1}] &= E[y_t] - \eta \nabla E[L(y_t - \eta\omega_t)] \end{align*} \]

In expectation, it is like doing GD using an average gradient of the neighborhood, like doing a convolution on gradient.