机器学习中的优化 Optimization Chapter 3 Projected Gradient Descent(2)

1. Smooth and strongly convex functions: $O(\log(1/\epsilon))$ steps

$\large\textbf{Theorem 3.5}$:
$f:dom(f) \rightarrow \mathbb{R} $ convex and differentiable. $f$ is smooth with parameter $L$ and strongly convex with parameter $\mu$. Choosing stepsize:

\[\gamma = \frac{1}{L} \]

Projected gradient descent with arbitary $x_0$ satisifies the following $2$ properties:

$\text{(i) Squared distances to }x^*\text{ are geometrically decreasing:}$

\[\begin{align} ||x_{t+1}-x^*||^2\leq (1-\frac{\mu}{L})||x_t-x^*||^2 \end{align} \]

$\textbf{Proof:}$
By projected gradient descent:

\[\begin{align} y_{t+1} &= x_t-\gamma g_t\\ x_{t+1} &= \prod_x(y_{t+1}) = \arg\min||x-y_{t+1}||^2 \end{align} \]

We also have this inequality:

\[||x-\prod_X(y)||^2+||y-\prod_X(y)||^2\leq ||x-y||^2 \]

Therefore:

\[\begin{align} g_t^T(x_t-x^*) &= \frac{1}{\gamma}(x_t-y_{t+1})^T(x_t-x^*)\\ &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2] \end{align} \]

By inequality:

\[||x^*-x_{t+1}||^2+||y_{t+1}-x_{t+1}||^2\leq ||x^*-y_{t+1}||^2 \]

Hence, for ($5$):

\[\begin{align} g_t^T(x_t-x^*) &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2]\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2] \end{align} \]

Then by strongly convex:

\[f(y)\geq f(x)+g_t^T(y-x)+\frac{\mu}{2}||x-y||^2 \]

Hence:

\[\begin{align} f(x_t)-f(x^*)&\leq g_t^T(x-x^*)-\frac{\mu}{2}||x_t-x^*||^2\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2]-\frac{\mu}{2}||x_t-x^*||^2 \end{align} \]

Here we get $||x_{t+1}-x^*||^2$, so now we bound it by rewriting:

\[||x_{t+1}-x^*||\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2 \]

$\textbf{Remember in previous chapter Lemma 3.3:}$

\[\begin{align} f(x_{t+1})\leq f(x_t)-\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Thus:

\[\begin{align} f(x^*)-f(x_t)\leq f(x_{t+1})-f(x_t)\leq -\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Then:

\[\begin{align} ||x_{t+1}-x^*||&\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2\\ &\leq \gamma^2||g_t||^2+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2+\gamma[-\frac{1}{L}||g_t||^2+L||y_{t+1}-x_{t+1}||^2]\\ &= (1-\gamma\mu) ||x_t-x^*||^2 \end{align} \]

$\text{(ii)}$

\[\begin{align} f(x_T)-f(x^*)\leq ||\nabla f(x^*)||(1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})^T||x_0-x^*||^2 \end{align} \]

$\textbf{Proof:}$
By smoothness we have:

\[\begin{align} f(x_T)-f(x^*)&\leq \nabla f(x^*)^T(x_T-x^*)+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)||\cdot ||x_T-x^*||+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)|| (1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})||x_0-x^*||^2 \end{align} \]

2. Projecting onto $l_1$-balls

\[X = B_1(R) = \{ x\in\mathbb{R}^d:||x||_1 = \sum_{i=1}^d|x_i|\leq R \} \]

$\textbf{Fact 3.6}$ We may assume without loss of generality that $(i) R=1, (ii)v_i\geq 0$ for all $i$, and $(iii) \sum_{i=1}^dv_i>1$

$\textbf{Fact 3.7}$ Under assumptions of Fact $3.6$, $x = \prod_X(v)$ satisfies $x_i\geq 0$ for all $i$ and $\sum_{i=1}^dx_i=1 $

$\textbf{Corollary 3.8}$ Under the assumptions of Fact $3.6$,

\[\prod_X(v) = \arg\min_{x\in \Delta_d}||x-v||^2 \]

where

\[\Delta_d = \{x\in\mathbb{R}^d:\sum_{i=1}^dx_i=1,x_i\geq 0 \} \]

is the standard simplx.

$\textbf{Fact 3.9}$ We may assume that $v_1\geq v_2\geq ...\geq v_d$

$\large\textbf{Lemma 3.10}$ Let $x^* = \arg\min_{x\in\Delta_d}||x-v||^2 $. Under the assumption of Fact $3.9$, there exists a unique $p\in \{1,...,d \}$, such that:

\[\begin{align} x_i^*&>0,i\leq p,\\ x_i&=0, i>p \end{align} \]

$\textbf{Proof:}$
Recall previous lemma:

\[\nabla f(x^*)^T(x-x^*)\geq 0 \]

Therefore:

\[\nabla d_v(x^*)^T(x-x^*) = 2(x^*-v)^T(x-x^*)\geq 0,x\in \Delta_d \]

where

\[d_v(z) = ||z-v||^2 \]

Because $\sum_i x_i^*=1$, there is at least one positive entry in $x^*$. We still need to show that we cannot have $x_i^*=0$ and $x_{i+1}^*>0$. Indeed we could decrease $x_{i+1}^*$ by some positive $\epsilon$ and simultaneously increase $x_i^*$ to $\epsilon$ to obtain a vector $x\in\Delta_d$ such that:

\[(x^*-v)^T(x-x^*) = (0-v_i)\epsilon -(x_{i+1}^*-v_{i+1})\epsilon = \epsilon(v_{i+1}-v_i-x_{i+1}^*)<0 \]

contradicting the optimality.

$\large\textbf{Lemma 3.11}$ Under the assumption of Fact $3.9$, and with $p$ as in Lemma $3.10$:

\[x_i^* = v_i-\Theta_p,i\leq p \]

where

\[\Theta_p = \frac{1}{p}(\sum_{i=1}^Pv_i-1) \]

$\\$

$\large\textbf{Lemma 3.12}$ Under the assumption of Fact $3.9$, with $x^*(p)$ as

\[x^*(p) = (v_1-\Theta_p,...,v_p-\Theta_p,0,...0),p\in\{1,...,d \} \]

and with

\[p^* = \max\{p\in\{1,...,d\}:v_p-\frac{1}{p}(\sum_{i=1}^pv_i-1)>0 \} \]

it holds that:

\[\arg\min_{x\in\Delta_d}||x-v||^2 = x^*(p^*) \]

3. Proximal Gradient Descent

An important class of objective functions is composed as:

\[f(x) = g(x)+h(x) \]

where $g$ is a nice function. The classical gradient step for unconstrained minimization of a function $g$ can be equivalently written as

\[\begin{align} x_{t+1}&=\arg\min_{y}g(x_t)+\nabla g(x_t)^T(y-x_t)+\frac{1}{2\gamma}||y-x_t||^2\\ &=\arg\min_{y} \frac{1}{2\gamma}||y - (x_t-\gamma\nabla g(x_t))||^2 \end{align} \]

Proximal Gradient Algorithm

$\text{Proximal Mapping:}$

\[\text{prox}_{h,\gamma}(z) = \arg\min_y\{\frac{1}{2\gamma}||y-z||^2+h(y) \} \]

An iteration of $\text{Proximal gradient descent}$ is defined as:

\[x_{t+1}=\text{prox}_{h,\gamma}(x-\gamma\nabla g(x_t)) \]

posted on 2022-05-19 19:53 Blackzxy 阅读(47) 评论(0) 编辑收藏举报

刷新页面返回顶部

xinyu04

导航

公告