1. Smooth and strongly convex functions: \(O(\log(1/\epsilon))\) steps
\(\large\textbf{Theorem 3.5}\):
$f:dom(f) \rightarrow \mathbb{R} $ convex and differentiable. \(f\) is smooth with parameter \(L\) and strongly convex with parameter \(\mu\). Choosing stepsize:
\[\gamma = \frac{1}{L}
\]
Projected gradient descent with arbitary \(x_0\) satisifies the following \(2\) properties:
- \(\text{(i) Squared distances to }x^*\text{ are geometrically decreasing:}\)
\[\begin{align}
||x_{t+1}-x^*||^2\leq (1-\frac{\mu}{L})||x_t-x^*||^2
\end{align}
\]
\(\textbf{Proof:}\)
By projected gradient descent:
\[\begin{align}
y_{t+1} &= x_t-\gamma g_t\\
x_{t+1} &= \prod_x(y_{t+1}) = \arg\min||x-y_{t+1}||^2
\end{align}
\]
We also have this inequality:
\[||x-\prod_X(y)||^2+||y-\prod_X(y)||^2\leq ||x-y||^2
\]
Therefore:
\[\begin{align}
g_t^T(x_t-x^*) &= \frac{1}{\gamma}(x_t-y_{t+1})^T(x_t-x^*)\\
&=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2]
\end{align}
\]
By inequality:
\[||x^*-x_{t+1}||^2+||y_{t+1}-x_{t+1}||^2\leq ||x^*-y_{t+1}||^2
\]
Hence, for (\(5\)):
\[\begin{align}
g_t^T(x_t-x^*) &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2]\\
&\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2]
\end{align}
\]
Then by strongly convex:
\[f(y)\geq f(x)+g_t^T(y-x)+\frac{\mu}{2}||x-y||^2
\]
Hence:
\[\begin{align}
f(x_t)-f(x^*)&\leq g_t^T(x-x^*)-\frac{\mu}{2}||x_t-x^*||^2\\
&\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2]-\frac{\mu}{2}||x_t-x^*||^2
\end{align}
\]
Here we get \(||x_{t+1}-x^*||^2\), so now we bound it by rewriting:
\[||x_{t+1}-x^*||\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2
\]
\(\textbf{Remember in previous chapter Lemma 3.3:}\)
\[\begin{align}
f(x_{t+1})\leq f(x_t)-\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2
\end{align}
\]
Thus:
\[\begin{align}
f(x^*)-f(x_t)\leq f(x_{t+1})-f(x_t)\leq -\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2
\end{align}
\]
Then:
\[\begin{align}
||x_{t+1}-x^*||&\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2\\
&\leq \gamma^2||g_t||^2+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2+\gamma[-\frac{1}{L}||g_t||^2+L||y_{t+1}-x_{t+1}||^2]\\
&= (1-\gamma\mu) ||x_t-x^*||^2
\end{align}
\]
\[\begin{align}
f(x_T)-f(x^*)\leq ||\nabla f(x^*)||(1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})^T||x_0-x^*||^2
\end{align}
\]
\(\textbf{Proof:}\)
By smoothness we have:
\[\begin{align}
f(x_T)-f(x^*)&\leq \nabla f(x^*)^T(x_T-x^*)+\frac{L}{2}||x^*-x_T||^2\\
&\leq ||\nabla f(x^*)||\cdot ||x_T-x^*||+\frac{L}{2}||x^*-x_T||^2\\
&\leq ||\nabla f(x^*)|| (1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})||x_0-x^*||^2
\end{align}
\]
2. Projecting onto \(l_1\)-balls
\[X = B_1(R) = \{ x\in\mathbb{R}^d:||x||_1 = \sum_{i=1}^d|x_i|\leq R \}
\]
\(\textbf{Fact 3.6}\) We may assume without loss of generality that \((i) R=1, (ii)v_i\geq 0\) for all \(i\), and \((iii) \sum_{i=1}^dv_i>1\)
\(\textbf{Fact 3.7}\) Under assumptions of Fact \(3.6\), \(x = \prod_X(v)\) satisfies \(x_i\geq 0\) for all \(i\) and $\sum_{i=1}^dx_i=1 $
\(\textbf{Corollary 3.8}\) Under the assumptions of Fact \(3.6\),
\[\prod_X(v) = \arg\min_{x\in \Delta_d}||x-v||^2
\]
where
\[\Delta_d = \{x\in\mathbb{R}^d:\sum_{i=1}^dx_i=1,x_i\geq 0 \}
\]
is the standard simplx.
\(\textbf{Fact 3.9}\) We may assume that \(v_1\geq v_2\geq ...\geq v_d\)
\(\large\textbf{Lemma 3.10}\) Let $x^* = \arg\min_{x\in\Delta_d}||x-v||^2 $. Under the assumption of Fact \(3.9\), there exists a unique \(p\in \{1,...,d \}\), such that:
\[\begin{align}
x_i^*&>0,i\leq p,\\
x_i&=0, i>p
\end{align}
\]
\(\textbf{Proof:}\)
Recall previous lemma:
\[\nabla f(x^*)^T(x-x^*)\geq 0
\]
Therefore:
\[\nabla d_v(x^*)^T(x-x^*) = 2(x^*-v)^T(x-x^*)\geq 0,x\in \Delta_d
\]
where
\[d_v(z) = ||z-v||^2
\]
Because \(\sum_i x_i^*=1\), there is at least one positive entry in \(x^*\). We still need to show that we cannot have \(x_i^*=0\) and \(x_{i+1}^*>0\). Indeed we could decrease \(x_{i+1}^*\) by some positive \(\epsilon\) and simultaneously increase \(x_i^*\) to \(\epsilon\) to obtain a vector \(x\in\Delta_d\) such that:
\[(x^*-v)^T(x-x^*) = (0-v_i)\epsilon -(x_{i+1}^*-v_{i+1})\epsilon = \epsilon(v_{i+1}-v_i-x_{i+1}^*)<0
\]
contradicting the optimality.
\(\large\textbf{Lemma 3.11}\) Under the assumption of Fact \(3.9\), and with \(p\) as in Lemma \(3.10\):
\[x_i^* = v_i-\Theta_p,i\leq p
\]
where
\[\Theta_p = \frac{1}{p}(\sum_{i=1}^Pv_i-1)
\]
\(\\\)
\(\large\textbf{Lemma 3.12}\) Under the assumption of Fact \(3.9\), with \(x^*(p)\) as
\[x^*(p) = (v_1-\Theta_p,...,v_p-\Theta_p,0,...0),p\in\{1,...,d \}
\]
and with
\[p^* = \max\{p\in\{1,...,d\}:v_p-\frac{1}{p}(\sum_{i=1}^pv_i-1)>0 \}
\]
it holds that:
\[\arg\min_{x\in\Delta_d}||x-v||^2 = x^*(p^*)
\]
3. Proximal Gradient Descent
An important class of objective functions is composed as:
\[f(x) = g(x)+h(x)
\]
where \(g\) is a nice function. The classical gradient step for unconstrained minimization of a function \(g\) can be equivalently written as
\[\begin{align}
x_{t+1}&=\arg\min_{y}g(x_t)+\nabla g(x_t)^T(y-x_t)+\frac{1}{2\gamma}||y-x_t||^2\\
&=\arg\min_{y} \frac{1}{2\gamma}||y - (x_t-\gamma\nabla g(x_t))||^2
\end{align}
\]
Proximal Gradient Algorithm
\(\text{Proximal Mapping:}\)
\[\text{prox}_{h,\gamma}(z) = \arg\min_y\{\frac{1}{2\gamma}||y-z||^2+h(y) \}
\]
An iteration of \(\text{Proximal gradient descent}\) is defined as:
\[x_{t+1}=\text{prox}_{h,\gamma}(x-\gamma\nabla g(x_t))
\]