机器学习中的优化 Optimization Chapter 2 Gradient Descent(1)
1. Step of Gradient descent
2. Vanilla Analysis
\(\text{Let }{\bf g_t} = \nabla f(x_t)\),\(\text{ therefore we can get:}\)
\(\text{hence we get:}\)
\(\text{Basic vector equation: }\) \(2v^Tw = ||v||^2+||w||^2-||v-w||^2\)
\(\text{Hence we can obtain:}\)
\(\text{Then we sum up:}\)
\(\text{Then we take the Convexity into consideration: }f(y)>f(x)+g^T(y-x)\).\(\text{ Hence we can get:}\)
\(\text{Combine the inequality (9):}\)
\(\large \text{This gives us an upper bound for the average error.}\)
3. Lipschitz Convex function: \(O(1/\epsilon^2)\) steps
\(\bf \large \text{Theorem }2.1\):
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ convex and differentiable with a global minimum }x^*; \text{Suppose that }||x_0-x^*||\leq R, ||\nabla f(x)||\leq B \text{ for all }x.\text{ Choosing the stepsize: } \gamma = \frac{R}{B\sqrt{T}},\text{gradient descent yields: }\)
\(\large\bf Proof:\)
\(\text{From inequality (11), we can just put the assumption together and get the results.}\)
4. Smooth Convex functions: \(O(1/\epsilon)\) steps
\(\text{Definition }2.2:\text{ Smooth with a parameter }L:\)
\(\large \text{More generally, all quadratic functions of the form }f(x) = x^TQx+b^Tx+c \text{ are }\bf smooth.\)
\(\bf\large \text{Lemma } 2.4:\)
\(f:\mathbb{R^d}\rightarrow \mathbb{R} \text{ be convex and differentiable. The following statements are equivalent:}\)
\(\bf \large \text{Lemma }2.6:\)
\(f:\mathbb{R^d}\rightarrow\mathbb{R}\text{ be }{\bf differentiable\ and\ smooth with\ parameter\ }L. \text{ Choosing }\gamma= \frac{1}{L},\text{ gradient descent yields}\)
\(\bf \large Proof:\)
\(\text{Obviously, we can get }\)
\(\text{By Smooth definition:}\)