xinyu04

导航

机器学习中的优化 Optimization Chapter 2 Gradient Descent(1)

1. Step of Gradient descent

\[\begin{equation} x_{t+1} = x_t-\gamma \nabla f(x_t) \end{equation} \]

2. Vanilla Analysis

\(\text{Let }{\bf g_t} = \nabla f(x_t)\),\(\text{ therefore we can get:}\)

\[\begin{equation} g_t = (x_t-x_{t+1})/\gamma \end{equation} \]

\(\text{hence we get:}\)

\[\begin{equation} g_t^T(x_t-x^*) = \frac{1}{\gamma}(x_t-x_{t+1})^T(x_t-x^*) \end{equation} \]

\(\text{Basic vector equation: }\) \(2v^Tw = ||v||^2+||w||^2-||v-w||^2\)

\(\text{Hence we can obtain:}\)

\[\begin{align} g_t^T(x_t-x^*)&= \frac{1}{\gamma}(x_t-x_{t+1})^T(x_t-x^*)\\ &=\frac{1}{2\gamma}[||x_t-x_{t+1}||^2+||x_t-x^*||^2-||x_{t+1}-x^*||^2]\\ &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x_{t+1}-x^*||^2]\\ &=\frac{\gamma}{2}||g_t||^2+\frac{1}{2\gamma}[||x_t-x^*||^2-||x_{t+1}-x^*||^2] \end{align} \]

\(\text{Then we sum up:}\)

\[\begin{align} \sum_{t=0}^{T-1}g_t^T(x_t-x^*)&= \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}[||x_0-x^*||^2-||x_{T}-x^*||^2]\\ &\leq \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}||x_0-x^*||^2 \end{align} \]

\(\text{Then we take the Convexity into consideration: }f(y)>f(x)+g^T(y-x)\).\(\text{ Hence we can get:}\)

\[\begin{align} f(x_t)-f(x^*)<g_t^T(x_t-x^*) \end{align} \]

\(\text{Combine the inequality (9):}\)

\[\begin{align} \large \sum_{t=0}^{T-1}f(x_t)-f(x^*) &\leq \frac{\gamma}{2}\sum_{t=0}^{T-1}||g_t||^2+\frac{1}{2\gamma}||x_0-x^*||^2 \end{align} \]

\(\large \text{This gives us an upper bound for the average error.}\)

3. Lipschitz Convex function: \(O(1/\epsilon^2)\) steps

\(\bf \large \text{Theorem }2.1\):
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ convex and differentiable with a global minimum }x^*; \text{Suppose that }||x_0-x^*||\leq R, ||\nabla f(x)||\leq B \text{ for all }x.\text{ Choosing the stepsize: } \gamma = \frac{R}{B\sqrt{T}},\text{gradient descent yields: }\)

\[\begin{align} \frac{1}{T}\sum_{t=0}^{T-1}(f(x_t)-f(x^*))\leq \frac{RB}{\sqrt{T}} \end{align} \]

\(\large\bf Proof:\)
\(\text{From inequality (11), we can just put the assumption together and get the results.}\)

4. Smooth Convex functions: \(O(1/\epsilon)\) steps

\(\text{Definition }2.2:\text{ Smooth with a parameter }L:\)

\[\begin{align} f(y) \leq f(x)+g(x)^T(y-x)+\frac{L}{2}||x-y||^2 \end{align} \]

\(\large \text{More generally, all quadratic functions of the form }f(x) = x^TQx+b^Tx+c \text{ are }\bf smooth.\)

\(\bf\large \text{Lemma } 2.4:\)
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ be convex and differentiable. The following statements are equivalent:}\)

\[\begin{align} &(i) f\text{ is smooth with parameter }L\\ &(ii) ||\nabla f(y)-\nabla f(x)||\leq L||x-y|| \end{align} \]

\(\bf \large \text{Lemma }2.6:\)
\(f:\mathbb{R^d}\rightarrow\mathbb{R}\text{ be }{\bf differentiable\ and\ smooth with\ parameter\ }L. \text{ Choosing }\gamma= \frac{1}{L},\text{ gradient descent yields}\)

\[\begin{align} f(x_{t+1})\leq f(x_t)-\frac{1}{2L}||\nabla f(x_t)||^2 \end{align} \]

\(\bf \large Proof:\)
\(\text{Obviously, we can get }\)

\[\begin{align} x_{t+1} = x_t-\frac{1}{L}\nabla f(x_t) \end{align} \]

\(\text{By Smooth definition:}\)

\[\begin{align} f(x_{t+1}) &\leq f(x_t)+g(x_t)^T(x_{t+1}-x_t)+\frac{L}{2}||x_t-x_{t+1}||^2\\ &\leq f(x_t)+L(x_t-x_{t+1})^T(x_{t+1}-x_t)+\frac{L}{2}||x_t-x_{t+1}||^2\\ &\leq f(x_t)-\frac{L}{2}\frac{1}{L^2}||\nabla f(x_t)||^2\\ &=f(x_t)-\frac{1}{2L}||\nabla f(x_t)||^2 \end{align} \]

posted on 2022-04-25 06:12  Blackzxy  阅读(25)  评论(0编辑  收藏  举报