ML Theory 太魔怔了!!!!!
接上文,GD 有 \(\frac 1T\) 的收敛速率而 SGD 只有 \(\frac 1{\sqrt T}\) 的收敛速率。有许多种方法可以加速 SGD 的收敛速度。有一类算法是通过让方差呈递减趋势下降,最终以与 GD 同阶的速度收敛(凸与 \(L\)-平滑以 \(\frac 1T\) 速度收敛;强凸与 \(L\)-平滑以线性速度收敛)。这里介绍其中一种,SVRG (stochastic variance reduced gradient)。
SVRG 的想法在于,有一个随机变量 \(X\),我们要通过蒙特卡洛法估计 \(\mathbb EX\),而有一个易于计算 \(\mathbb EY\) 的随机变量 \(Y\),且 \(X, Y\) 强相关,那么可以通过
\[\theta_\alpha = \alpha(X - Y) + \mathbb EY
\]
来估计 \(\mathbb EX\)。其中
\[\mathbb E\theta_\alpha = \alpha \mathbb EX + (1 - \alpha) \mathbb EY
\]
\[\mathrm{Var}(\theta_\alpha) = \mathrm{Var}(\alpha(X - Y)) = \alpha^2\mathrm{Var}(X - Y)
\]
当 \(\alpha = 1\) 时,\(\mathbb E\theta_\alpha = \mathbb EX, \mathrm{Var}(\theta_\alpha) = \mathrm{Var}(X - Y)\)。这样便不依赖于 \(\mathrm{Var}(X)\) 而只依赖于 \(X, Y\) 的相关性。
我们可以这样设计一个算法:\(X\) 在当前的 \(\bm w_j\),\(Y\) 是 \(\bm w_j\) 的 snapshot,\(\bm w_{\lfloor \frac jm \rfloor m}\)。我们将 \(m\) 设计得足够大使得可以接受计算 \(\bm w_{im}\) 完整梯度的代价,这些梯度就是 \(\mathbb EY\)。
直观地讲,我们用 \(X_0\) 中随机梯度与梯度的差来估计 \(X_t\) 中随机梯度与梯度的差,如图,红色和橙色为两者的随机梯度,黑色为已知的,绿色为要估计的,取黑+橙-红。这样问题就不在于橙色与绿色的角度长度偏差有多大,而在于 \(X_0\) 和 \(X_t\) 的偏差的差有多大。
完整算法描述如下:
SVRG 算法
- For \(s = 1, 2, \ldots\)
- \(\widetilde {\bm w} = \widetilde {\bm w}_{s-1}\)
- \(\widetilde {\bm u} = \frac 1N\sum_{i=1}^N \nabla l_i(\widetilde {\bm w}) = \nabla f(\widetilde {\bm w})\)
- \({\bm w}_0 = \widetilde {\bm w}\)
- For \(t = 1, 2, \ldots, m\)
- 随机选取 \(i \in [N]\)
- \({\bm w}_t = {\bm w}_{t-1} - \eta(\nabla l_i({\bm w}_{t-1}) - \nabla l_i(\widetilde {\bm w}) + \widetilde {\bm u})\)
- 随机选取 \({\bm w}_t,\ t \in \{0, 1, \ldots, m - 1\}\) 作为 \(\widetilde {\bm w}_s\)。
|
最后一步随机选取是为了可以用 \(\mathbb E\) 说事。如果选取 \(\widetilde {\bm w}_m\),没人证得出来。。
这里给出 \(\mu\)-强凸的收敛性分析。
令 \(\bm v_t = \nabla l_i(\bm w_{t-1}) - \nabla l_i(\widetilde {\bm w}) + \widetilde {\bm u}\)。
令
\[g_i(\bm w) = l_i(\bm w) - l_i(\bm w^*) - \langle \nabla l_i(\bm w^*), \bm w - \bm w^* \rangle
\]
由于 \(l_i(\bm w)\) 是凸的,\(g_i(\bm w)\) 也是凸的,而计算可得 \(\nabla g_i(\bm w^*) = 0\),因此 \(g_i(\bm w^*) = \min_{\bm w} g_i(\bm w)\)。
\[\begin{aligned}
0 = g_i(\bm w^*) &\leq \min_{\eta} \{g_i(\bm w - \eta \nabla g_i(\bm w))\} \\
&\leq \min_\eta\{g_i(\bm w) - \eta \|\nabla g_i(w)\|_2^2 + \frac L2 \eta^2 \|\nabla g_i(\bm w)\|_2^2\} && \text{Smoothness} \\
&= g_i(\bm w) - \frac 1{2L} \|\nabla g_i(\bm w)\|_2^2
\end{aligned}\]
这个过程就是平滑性取二次函数顶点。移项得
\[\|\nabla l_i(\bm w) - \nabla l_i(\bm w^*)\|_2^2 \leq 2L(l_i(\bm w) - l_i(\bm w^*) - \langle\nabla l_i(\bm w^*), \bm w - \bm w^*\rangle)
\]
这个式子有点像 Lipschitz 的形式,区别在于左侧是范数的平方。
对所有 \(l_i\) 求和得
\[\mathbb E \|\nabla l_i(\bm w) - \nabla l_i(\bm w^*)\|_2^2 \leq 2L(f(\bm w) - f(\bm w^*))
\]
\[\begin{aligned}
\mathbb E\|\bm v_t\|^2_2
&\leq 2 \mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2 \mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*) - \nabla f(\widetilde {\bm w})\|^2_2 && \|a + b\|^2_2 \leq 2\|a\|^2_2 + 2\|b\|^2_2 \\
&\leq 2 \mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2 \mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*) - \mathbb E(\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*))\|^2_2 && \nabla f(\bm w^*) = 0 \\
&\leq 2\mathbb E\|\nabla l_i(\bm w_{t-1}) - \nabla l_i(\bm w^*)\|^2_2 + 2\mathbb E\|\nabla l_i(\widetilde {\bm w}) - \nabla l_i(\bm w^*)\|^2_2 && \mathbb E\|X - \mathbb EX\|^2_2 = \mathbb E \|X\|^2_2 - \|\mathbb EX\|^2_2 \leq \mathbb E\|X\|_2^2 \\
&\leq 4L(f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*))
\end{aligned}\]
\[\begin{aligned}
\mathbb E\|\bm w_t - \bm w^*\|_2^2 &= \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta\langle \bm w_{t-1} - \bm w^*, \mathbb E \bm v_t \rangle + \eta^2 \mathbb E\|\bm v_t\|_2^2 \\
&\leq \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta\langle \bm w_{t-1} - \bm w^*, \nabla f(\bm w_{t-1})\rangle + 4L\eta^2 (f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*)) \\
&\leq \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta(f(\bm w_{t-1}) - f(\bm w^*)) + 4L\eta^2 (f(\bm w_{t-1}) - f(\bm w^*) + f(\widetilde{\bm w}) - f(\bm w^*)) && \text{Convexity} \\
&= \|\bm w_{t-1} - \bm w^*\|_2^2 - 2 \eta(1 - 2L \eta)(f(\bm w_{t-1}) - f(\bm w^*)) + 4L \eta^2 (f(\widetilde{\bm w}) - f(\bm w^*))
\end{aligned}\]
其中 \(4L \eta^2 (f(\widetilde{\bm w}) - f(\bm w^*))\) 是误差部分,其会随着 \(\widetilde{\bm w} \to \bm w^*\) 减小。
\[\begin{aligned}
\mathbb E\|\bm w_m - \bm w^*\|_2^2 &\leq \mathbb E \|\bm w_0 - \bm w^*\|_2^2 - 2m\eta(1 - 2 L \eta)\sum_{t=1}^m f((\bm w_{t-1}) - f(\bm w^*)) + 4mL\eta^2(f(\widetilde{\bm w}) - f(\bm w^*)) \\
&= \mathbb E \|\bm w_0 - \bm w^*\|_2^2 - 2m\eta(1 - 2 L \eta)\mathbb E(f(\widetilde {\bm w}_s) - f(\bm w^*)) + 4mL\eta^2(f(\widetilde{\bm w}) - f(\bm w^*))
\end{aligned}\]
我们拿到了想要的项,\(\mathbb E(f(\widetilde {\bm w}_s) - f(\bm w^*))\) 和 \(f(\widetilde{\bm w}) - f(\bm w^*)\) 的线性组合,等式左边可以直接缩放为 \(0\),因此只需要消掉 \(\mathbb E \|\bm w_0 - \bm w^*\|_2^2\) 便大功告成。
\[\begin{aligned}\mathbb E \|\bm w_0 - \bm w^*\|_2^2 = \mathbb E \|\widetilde{\bm w} - \bm w^*\|_2^2 \leq \frac 2\mu \mathbb E(f(\widetilde{\bm w}) - f(\bm w^*)) && \text{Convexity} + \nabla f(\bm w^*) = 0\end{aligned}
\]
\[\begin{aligned}
\mathbb E(f(\widetilde{\bm w}_s) - f(\bm w^*)) \leq \left(\frac 1{m\mu \eta (1 - 2L \eta)} + \frac {2L\eta}{1 - 2L\eta}\right) \mathbb E(f(\widetilde{\bm w}_{s-1} - f(\bm w^*))
\end{aligned}\]
因此,\(\eta < \frac 1L\),\(m\) 足够大时,系数会 \(< 1\),因此线性收敛。