Chapter 2: Optimization
Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".
What is optimization? Use algorithms to minimize a function. In machine learning, minimizing a loss function is very important.
GD: Gradient descent
To minimize a function
Smoothness assumption
(a)
(b)
(c)
这三者等价。证明如下:
(a)
(b)
(c)
(a)泰勒展开
Assume
Let
Convexity assumption
(a)
(b)
(c)
GD convergence guarantee
Using these 2 assumptions and the update rule, we can get a GD guarantee.
Lemma: Combine L smooth and update rule, we can get
Proof: Use L smooth condition (c) and the update rule
set learning rate
begin with the lemma
然后开始telescoping。第一个不等号是因为non increasing
-strongly convex assumption
(a)
(b)
(c)
和前面L - smooth类似,二阶导,导数,原函数的性质各一条。证明也类似,此处略过。
GD can get a linear convergence rate on an L-smooth
Then let's move on to the proof of the question. Set learning rate
SGD: Stochastic gradient descent
but gradient descent has 2 limitations:
- Computing full gradient is slow for big data
- Could get stuck at stationary points (stationary point & local minimum)
So we add some randomness to the GD algorithm
- If
is too small, say, =1, still has the right expectation, but with very large variance - If
is too big, say, =n, is slow to compute - Small but not too small
has at least two benefits- Has some noise to help escape saddle points or bad local minimum
- Helps to get the right mini-batch statistics for batch normalization (不懂)
If we consider update rule of SGD as
We run SGD with bounded variance
convergence analysis
because the update rule is different, so we can not begin with the lemma before. We need to propose a new lemma.
Lemma: combine L smooth and the update rule, then we can get
Proof: Use L smooth condition (c) and the update rule
SGD convergence guarantee
begin with
然后开始telescoping
convergence rate
proof: set
So
Thinking: It seems SGD is much slower than GD, why do we still use it?
- Noise 可能没有那么大
- Sometimes noise helps, especially in non-convex settings (we will see it later)
不需要到那么小: We do not care about training loss, but care about population loss. Even with 𝜖 = 0.1, it does not guarantee the population loss will be better compared with 𝜖 =0.1- The variance of the noise can be reduced gradually along the process. (therefore as fast as GD)
If the variance
The answer is yes, we could reduce the variance and get
Unfortunately, it seems to only work for convex function, not useful for deep learning (yet).
There are many such algorithms, i.e., SAG, SVRG, SAGA, etc.
SVRG
assumption: L-smooth and
Update rule
For
For t = 1, 2, ..., m
Random pick
备注,在下面证明的时候,
thinking
We compute the full gradient for every step. So although a full gradient is expensive, the amortized cost is small.
We assume
Convergence analysis
先研究小循环的一步,会造成什么影响
小循环中的
至于
(最后一步使用的结论
以上的放缩都没有用到smooth和convex的条件。
现在回头代入之前的式子
That is
在这里会有一些添加
前面的证明考虑的是,小循环中的一步
但是对于一整个小循环来说
左侧累加
右侧累加
接下来要对大循环的每一次从
This is linear convergence. faster than GD if
Linear coupling of GD & MD
What do we find in GD?
We want to have another optimization method to make use of a small gradient and still converge to the optimal point.
If we have a sequence of queried points, then:
不等号右边的这个东西叫做regret。我们可以观察到,gradient越小,regret越小,这个bound就越紧,达成了我们的目的: make use of small gradient and still convergent to the optimal point.
那这种优化方法也很显然了,就是每次选的点,都得让regret变小。
MD: mirror descent
作为一阶优化问题,我们还是只能沿着导数的方向去找点,但是gradient的超平面会延伸到负无穷,这肯定是我们不希望的,我们希望的是走一小步。GD中我们是如何解决这个问题呢?
引入正则项
现在考虑对这个正则项一般化,于是就提出了Bregman divergence:
The update rule of MD is:
convergence
Lemma: triangle equality
后面这两项可以telescoping
Let
所以MD是
Linear coupling
Can we combine GD and MD?
Wishful thinking, assume
Consider we need to update form
- Run MD, it needs
- Run GD, it decreases by
in each step. So it needs
Combine these two, we know
But this is an ideal situation, we rarely handle this kind of function. For an arbitrary function, if GD and MD go to 2 different points, how shall we choose?
This is why we use linear coupling: In every iteration, we run GD and MD simultaneously to get
We can choose
Then this can do telescoping!
We can choose
It seems its converge rate is
Matrix completion
For a matrix
Assumption
-
low rank
-
The known entries are uniformly distributed in their positions.
-
Incoherence (like smooth assumption in unsupervised learning)
- We hope all coordinates are comparable, rather than only a few of them being too large or too small.
- Maybe the "gradient" between the entries can not be too sharp.
is better than- Mathematically:
-
- Here
, , (Attention here, we do not need to complete the and to be a square matrix )
algorithm
Convex relaxation:
- How shall we calculate rank(A) naively?
. - But in this way, minimizing rank(A) is hard. Because zero norm is not a good function (not convex).
- relax to L1-norm: Minimizing
. or relax to nuclear norm .
This is equivalent to a convex SDP problem.
However! SDP can not solve a large matrix. Say larger than 100*100.
交替最小二乘法 alternating least square minimization
Use
Assume the solution is
The loss is nonconvex, one can easily find multiple local minimal, like
Every subproblem is convex and can be solved efficiently.
Non-convex Optimization
Is convex landscape the only good case for GD? No.
Another important case: no-spurious-local-min.
- no-spurious-local-min: If all the local min are equally good, it suffices to find one of them! (Although nonconvex)
Escape saddle point
For stationary point:
- If
, it's a local minimum or a flat saddle point. The latter one means we have no good direction to get out of this saddle point. That is all first-order algorithms will fail. - If
has both eigenvalues, it's a strict saddle point, which means at least one direction to escape. - We call a function "strict saddle" if it does not contain any flat saddle points.
An informal Theorem
SGD will escape all saddle points and local maximum, converge to a
local minimum after a polynomial number of steps based on the following assumption.
is a smooth, bounded and strict saddle. is smooth- SGD noise has non-negligible variance in every direction with constant probability
意义: 基于这个定理, 如果所有最小值一样好,且函数满足上面的条件,那么SGD就能收敛到全局最小值。
proof
For the current point
Case1:
Since the gradient is big, Noise will not affect the decreasing trend of function values.
Case2:
subcase1: If we close to a local minimum
A local minimum is a trap. Although we have noise in gradient, we still have very little probability of getting out. 这保证我们在收敛到最小值之后不会离开
subcase2: If we are close to a saddle point
There must be a negative direction according to strict saddle property. Then a random perturbation will give a positive
projection in the negative direction. We will follow this direction and escape the saddle.
Escape local minimum
If the local minimum trap is too wide, then noise makes no difference.
But if the local minimum trap is sharp and narrow, the noise makes sense.
Informal Proof
Update rule:
So we can get
In expectation, it is like doing GD using an average gradient of the neighborhood, like doing a convolution on gradient.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异