xinyu04

导航

机器学习中的优化 Optimization Chapter 3 Projected Gradient Descent(2)

1. Smooth and strongly convex functions: \(O(\log(1/\epsilon))\) steps

\(\large\textbf{Theorem 3.5}\):
$f:dom(f) \rightarrow \mathbb{R} $ convex and differentiable. \(f\) is smooth with parameter \(L\) and strongly convex with parameter \(\mu\). Choosing stepsize:

\[\gamma = \frac{1}{L} \]

Projected gradient descent with arbitary \(x_0\) satisifies the following \(2\) properties:

  • \(\text{(i) Squared distances to }x^*\text{ are geometrically decreasing:}\)

\[\begin{align} ||x_{t+1}-x^*||^2\leq (1-\frac{\mu}{L})||x_t-x^*||^2 \end{align} \]

\(\textbf{Proof:}\)
By projected gradient descent:

\[\begin{align} y_{t+1} &= x_t-\gamma g_t\\ x_{t+1} &= \prod_x(y_{t+1}) = \arg\min||x-y_{t+1}||^2 \end{align} \]

We also have this inequality:

\[||x-\prod_X(y)||^2+||y-\prod_X(y)||^2\leq ||x-y||^2 \]

Therefore:

\[\begin{align} g_t^T(x_t-x^*) &= \frac{1}{\gamma}(x_t-y_{t+1})^T(x_t-x^*)\\ &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2] \end{align} \]

By inequality:

\[||x^*-x_{t+1}||^2+||y_{t+1}-x_{t+1}||^2\leq ||x^*-y_{t+1}||^2 \]

Hence, for (\(5\)):

\[\begin{align} g_t^T(x_t-x^*) &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2]\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2] \end{align} \]

Then by strongly convex:

\[f(y)\geq f(x)+g_t^T(y-x)+\frac{\mu}{2}||x-y||^2 \]

Hence:

\[\begin{align} f(x_t)-f(x^*)&\leq g_t^T(x-x^*)-\frac{\mu}{2}||x_t-x^*||^2\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2]-\frac{\mu}{2}||x_t-x^*||^2 \end{align} \]

Here we get \(||x_{t+1}-x^*||^2\), so now we bound it by rewriting:

\[||x_{t+1}-x^*||\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2 \]

\(\textbf{Remember in previous chapter Lemma 3.3:}\)

\[\begin{align} f(x_{t+1})\leq f(x_t)-\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Thus:

\[\begin{align} f(x^*)-f(x_t)\leq f(x_{t+1})-f(x_t)\leq -\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Then:

\[\begin{align} ||x_{t+1}-x^*||&\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2\\ &\leq \gamma^2||g_t||^2+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2+\gamma[-\frac{1}{L}||g_t||^2+L||y_{t+1}-x_{t+1}||^2]\\ &= (1-\gamma\mu) ||x_t-x^*||^2 \end{align} \]

  • \(\text{(ii)}\)

\[\begin{align} f(x_T)-f(x^*)\leq ||\nabla f(x^*)||(1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})^T||x_0-x^*||^2 \end{align} \]

\(\textbf{Proof:}\)
By smoothness we have:

\[\begin{align} f(x_T)-f(x^*)&\leq \nabla f(x^*)^T(x_T-x^*)+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)||\cdot ||x_T-x^*||+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)|| (1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})||x_0-x^*||^2 \end{align} \]

2. Projecting onto \(l_1\)-balls

\[X = B_1(R) = \{ x\in\mathbb{R}^d:||x||_1 = \sum_{i=1}^d|x_i|\leq R \} \]

\(\textbf{Fact 3.6}\) We may assume without loss of generality that \((i) R=1, (ii)v_i\geq 0\) for all \(i\), and \((iii) \sum_{i=1}^dv_i>1\)

\(\textbf{Fact 3.7}\) Under assumptions of Fact \(3.6\), \(x = \prod_X(v)\) satisfies \(x_i\geq 0\) for all \(i\) and $\sum_{i=1}^dx_i=1 $

\(\textbf{Corollary 3.8}\) Under the assumptions of Fact \(3.6\),

\[\prod_X(v) = \arg\min_{x\in \Delta_d}||x-v||^2 \]

where

\[\Delta_d = \{x\in\mathbb{R}^d:\sum_{i=1}^dx_i=1,x_i\geq 0 \} \]

is the standard simplx.

\(\textbf{Fact 3.9}\) We may assume that \(v_1\geq v_2\geq ...\geq v_d\)

\(\large\textbf{Lemma 3.10}\) Let $x^* = \arg\min_{x\in\Delta_d}||x-v||^2 $. Under the assumption of Fact \(3.9\), there exists a unique \(p\in \{1,...,d \}\), such that:

\[\begin{align} x_i^*&>0,i\leq p,\\ x_i&=0, i>p \end{align} \]

\(\textbf{Proof:}\)
Recall previous lemma:

\[\nabla f(x^*)^T(x-x^*)\geq 0 \]

Therefore:

\[\nabla d_v(x^*)^T(x-x^*) = 2(x^*-v)^T(x-x^*)\geq 0,x\in \Delta_d \]

where

\[d_v(z) = ||z-v||^2 \]

Because \(\sum_i x_i^*=1\), there is at least one positive entry in \(x^*\). We still need to show that we cannot have \(x_i^*=0\) and \(x_{i+1}^*>0\). Indeed we could decrease \(x_{i+1}^*\) by some positive \(\epsilon\) and simultaneously increase \(x_i^*\) to \(\epsilon\) to obtain a vector \(x\in\Delta_d\) such that:

\[(x^*-v)^T(x-x^*) = (0-v_i)\epsilon -(x_{i+1}^*-v_{i+1})\epsilon = \epsilon(v_{i+1}-v_i-x_{i+1}^*)<0 \]

contradicting the optimality.

\(\large\textbf{Lemma 3.11}\) Under the assumption of Fact \(3.9\), and with \(p\) as in Lemma \(3.10\):

\[x_i^* = v_i-\Theta_p,i\leq p \]

where

\[\Theta_p = \frac{1}{p}(\sum_{i=1}^Pv_i-1) \]

\(\\\)

\(\large\textbf{Lemma 3.12}\) Under the assumption of Fact \(3.9\), with \(x^*(p)\) as

\[x^*(p) = (v_1-\Theta_p,...,v_p-\Theta_p,0,...0),p\in\{1,...,d \} \]

and with

\[p^* = \max\{p\in\{1,...,d\}:v_p-\frac{1}{p}(\sum_{i=1}^pv_i-1)>0 \} \]

it holds that:

\[\arg\min_{x\in\Delta_d}||x-v||^2 = x^*(p^*) \]

3. Proximal Gradient Descent

An important class of objective functions is composed as:

\[f(x) = g(x)+h(x) \]

where \(g\) is a nice function. The classical gradient step for unconstrained minimization of a function \(g\) can be equivalently written as

\[\begin{align} x_{t+1}&=\arg\min_{y}g(x_t)+\nabla g(x_t)^T(y-x_t)+\frac{1}{2\gamma}||y-x_t||^2\\ &=\arg\min_{y} \frac{1}{2\gamma}||y - (x_t-\gamma\nabla g(x_t))||^2 \end{align} \]

Proximal Gradient Algorithm

\(\text{Proximal Mapping:}\)

\[\text{prox}_{h,\gamma}(z) = \arg\min_y\{\frac{1}{2\gamma}||y-z||^2+h(y) \} \]

An iteration of \(\text{Proximal gradient descent}\) is defined as:

\[x_{t+1}=\text{prox}_{h,\gamma}(x-\gamma\nabla g(x_t)) \]

posted on 2022-05-19 19:53  Blackzxy  阅读(47)  评论(0编辑  收藏  举报