Exploring Recursion in Convex Optimization
Recursion in optimization
In this blog post, I aim to provide a overview of the various recursive methods I have seen in convex optimization. Optimization methods often yield a sequence denoted as \(\{x_t\}_{t\geq 0}\), prompting a detailed analysis of the convergence properties inherent in these sequences.
Convergent sequence
The concept of convergent and divergent sequences is fundamental in advanced mathematical studies, such as the well-known Cauchy sequence. In the following discussion, I will narrow our focus to specific recursion techniques, shedding light on their convergence properties. To delve deeper into this subject, let's explore a particular recursion method:
The formulation mentioned above holds universal significance in the field of optimization (see Polyak ch 2.2.3).
Now, let's delve into two fundamental lemmas that play a crucial role in determining the convergence of a sequence governed by Eq.\eqref{eq:base}.
Lemma 1 Let \(x_t \geq 0\) and
where \(0\leq \gamma_t < 1\), \(\varepsilon \geq 0\), \(\sum_{t=0}^{\infty} (1-\gamma_t) =\infty\), and \(\varepsilon_t/ (1-\gamma_t) \rightarrow 0\), then
The condition \(\frac{\varepsilon}{1-\gamma_t} \to 0\) implies that \(\varepsilon = o(1-\gamma_t)\).
Lemma 2 Let \(x_t \geq 0\) and
where \(\sum_{t=0}^{\infty} \alpha_t < \infty\) and \(\sum_{t=0}^\infty \varepsilon_t < \infty\), then
Drawing from these two lemmas, it becomes evident that various convergent recursions can be derived, as illustrated in Franci, B..
Convergence rate
When dealing with a convergent sequence, a critical question arises: How many iterations are needed to obtain an \(\epsilon\)-approximate solution? In other words, what is the convergence rate for this sequence? Understanding the convergence rate allows us to select the most efficient method based on the least number of iterations required.
To begin our analysis, let's consider a scenario where both \(\gamma_t \equiv \gamma\) and \(\varepsilon_t \equiv \varepsilon\) are fixed. While this violates the assumption in Lemma 1, it provides a starting point for our exploration. Expanding the recursion \eqref{eq:base}, we obtain:
This indicates that \(x_{t+1}\) lies within the \(\frac{\varepsilon}{1-\gamma}\)-neighborhood of \(0\). Achieving a convergence guarantee necessitates driving \(\frac{\varepsilon}{1-\gamma}\) to approach \(0\), as outlined in Lemma 1.
As highlighted in Bottou thm4.6, employing a constant learning rate results in the linear convergence of expected objective values to a neighborhood of the optimal value. While a smaller stepsize might degrade the contraction constant in the convergence rate, it facilitates approaching closer to the optimal value.
To ensure convergence, we opt for a diminishing stepsize strategy, leading to a linear convergence rate of \(\mathcal{O}\left(\frac{1}{T}\right)\).
In general, we reformulate \eqref{eq:base} as follows:
Theorem 1 Let \(x_t \geq 0\) and
where \(\eta_t = \frac{b}{a + t}\), \(b, \sigma^2 > 0\), \(a\geq 0\), then
where \(v=\max\{(a + 1)x_0, \frac{b^2\sigma^2}{b-1}\}\).
We prove it by induction. When \(t=1\), it obviously holds. Assume that it holds for \(t\), then
which leads to \(x_{t+1} \leq \frac{v}{t+1+a}\).
In Polyak lemma 2.2.4, there is the convergence rate for the recursion
As shown in Bottou thm5.1, when \(\varepsilon_t\) converges geometrically, the recursion exhibits a linear convergence rate with a constant stepsize, a technique often referred to as dynamic sampling.
Reference
- Polyak, B. T. (1987). Introduction to optimization.
- Franci, B., & Grammatico, S. (2022). Convergence of sequences: A survey. Annual Reviews in Control, 53, 161-186.
- Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM review, 60(2), 223-311.
- Stich, S. U. (2019). Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232.