1. Step of Gradient descent
x t + 1 = x t − γ ∇ f ( x t ) (1) (1) x t + 1 = x t − γ ∇ f ( x t )
2. Vanilla Analysis
Let g t = ∇ f ( x t ) Let g t = ∇ f ( x t ) , therefore we can get: therefore we can get:
g t = ( x t − x t + 1 ) / γ (2) (2) g t = ( x t − x t + 1 ) / γ
hence we get: hence we get:
g T t ( x t − x ∗ ) = 1 γ ( x t − x t + 1 ) T ( x t − x ∗ ) (3) (3) g t T ( x t − x ∗ ) = 1 γ ( x t − x t + 1 ) T ( x t − x ∗ )
Basic vector equation: Basic vector equation: 2 v T w = | | v | | 2 + | | w | | 2 − | | v − w | | 2 2 v T w = | | v | | 2 + | | w | | 2 − | | v − w | | 2
Hence we can obtain: Hence we can obtain:
g T t ( x t − x ∗ ) = 1 γ ( x t − x t + 1 ) T ( x t − x ∗ ) = 1 2 γ [ | | x t − x t + 1 | | 2 + | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ] = 1 2 γ [ γ 2 | | g t | | 2 + | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ] = γ 2 | | g t | | 2 + 1 2 γ [ | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ] (4) (5) (6) (7) (4) g t T ( x t − x ∗ ) = 1 γ ( x t − x t + 1 ) T ( x t − x ∗ ) (5) = 1 2 γ [ | | x t − x t + 1 | | 2 + | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ] (6) = 1 2 γ [ γ 2 | | g t | | 2 + | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ] (7) = γ 2 | | g t | | 2 + 1 2 γ [ | | x t − x ∗ | | 2 − | | x t + 1 − x ∗ | | 2 ]
Then we sum up: Then we sum up:
T − 1 ∑ t = 0 g T t ( x t − x ∗ ) = γ 2 T − 1 ∑ t = 0 | | g t | | 2 + 1 2 γ [ | | x 0 − x ∗ | | 2 − | | x T − x ∗ | | 2 ] ≤ γ 2 T − 1 ∑ t = 0 | | g t | | 2 + 1 2 γ | | x 0 − x ∗ | | 2 (8) (9) (8) ∑ t = 0 T − 1 g t T ( x t − x ∗ ) = γ 2 ∑ t = 0 T − 1 | | g t | | 2 + 1 2 γ [ | | x 0 − x ∗ | | 2 − | | x T − x ∗ | | 2 ] (9) ≤ γ 2 ∑ t = 0 T − 1 | | g t | | 2 + 1 2 γ | | x 0 − x ∗ | | 2
Then we take the Convexity into consideration: f ( y ) > f ( x ) + g T ( y − x ) Then we take the Convexity into consideration: f ( y ) > f ( x ) + g T ( y − x ) . Hence we can get: Hence we can get:
f ( x t ) − f ( x ∗ ) < g T t ( x t − x ∗ ) (10) (10) f ( x t ) − f ( x ∗ ) < g t T ( x t − x ∗ )
Combine the inequality (9): Combine the inequality (9):
T − 1 ∑ t = 0 f ( x t ) − f ( x ∗ ) ≤ γ 2 T − 1 ∑ t = 0 | | g t | | 2 + 1 2 γ | | x 0 − x ∗ | | 2 (11) (11) ∑ t = 0 T − 1 f ( x t ) − f ( x ∗ ) ≤ γ 2 ∑ t = 0 T − 1 | | g t | | 2 + 1 2 γ | | x 0 − x ∗ | | 2
This gives us an upper bound for the average error. This gives us an upper bound for the average error.
3. Lipschitz Convex function: O ( 1 / ϵ 2 ) O ( 1 / ϵ 2 ) steps
Theorem 2.1 Theorem 2.1 :
f : R d → R convex and differentiable with a global minimum x ∗ ; Suppose that | | x 0 − x ∗ | | ≤ R , | | ∇ f ( x ) | | ≤ B for all x . Choosing the stepsize: γ = R B √ T , gradient descent yields: f : R d → R convex and differentiable with a global minimum x ∗ ; Suppose that | | x 0 − x ∗ | | ≤ R , | | ∇ f ( x ) | | ≤ B for all x . Choosing the stepsize: γ = R B T , gradient descent yields:
1 T T − 1 ∑ t = 0 ( f ( x t ) − f ( x ∗ ) ) ≤ R B √ T (12) (12) 1 T ∑ t = 0 T − 1 ( f ( x t ) − f ( x ∗ ) ) ≤ R B T
P r o o f : P r o o f :
From inequality (11), we can just put the assumption together and get the results. From inequality (11), we can just put the assumption together and get the results.
4. Smooth Convex functions: O ( 1 / ϵ ) O ( 1 / ϵ ) steps
Definition 2.2 : Smooth with a parameter L : Definition 2.2 : Smooth with a parameter L :
f ( y ) ≤ f ( x ) + g ( x ) T ( y − x ) + L 2 | | x − y | | 2 (13) (13) f ( y ) ≤ f ( x ) + g ( x ) T ( y − x ) + L 2 | | x − y | | 2
More generally, all quadratic functions of the form f ( x ) = x T Q x + b T x + c are s m o o t h . More generally, all quadratic functions of the form f ( x ) = x T Q x + b T x + c are s m o o t h .
Lemma 2.4 : Lemma 2.4 :
f : R d → R be convex and differentiable. The following statements are equivalent: f : R d → R be convex and differentiable. The following statements are equivalent:
( i ) f is smooth with parameter L ( i i ) | | ∇ f ( y ) − ∇ f ( x ) | | ≤ L | | x − y | | (14) (15) (14) ( i ) f is smooth with parameter L (15) ( i i ) | | ∇ f ( y ) − ∇ f ( x ) | | ≤ L | | x − y | |
Lemma 2.6 : Lemma 2.6 :
f : R d → R be d i f f e r e n t i a b l e a n d s m o o t h w i t h p a r a m e t e r L . Choosing γ = 1 L , gradient descent yields f : R d → R be d i f f e r e n t i a b l e a n d s m o o t h w i t h p a r a m e t e r L . Choosing γ = 1 L , gradient descent yields
f ( x t + 1 ) ≤ f ( x t ) − 1 2 L | | ∇ f ( x t ) | | 2 (16) (16) f ( x t + 1 ) ≤ f ( x t ) − 1 2 L | | ∇ f ( x t ) | | 2
P r o o f : P r o o f :
Obviously, we can get Obviously, we can get
x t + 1 = x t − 1 L ∇ f ( x t ) (17) (17) x t + 1 = x t − 1 L ∇ f ( x t )
By Smooth definition: By Smooth definition:
f ( x t + 1 ) ≤ f ( x t ) + g ( x t ) T ( x t + 1 − x t ) + L 2 | | x t − x t + 1 | | 2 ≤ f ( x t ) + L ( x t − x t + 1 ) T ( x t + 1 − x t ) + L 2 | | x t − x t + 1 | | 2 ≤ f ( x t ) − L 2 1 L 2 | | ∇ f ( x t ) | | 2 = f ( x t ) − 1 2 L | | ∇ f ( x t ) | | 2 (18) (19) (20) (21) (18) f ( x t + 1 ) ≤ f ( x t ) + g ( x t ) T ( x t + 1 − x t ) + L 2 | | x t − x t + 1 | | 2 (19) ≤ f ( x t ) + L ( x t − x t + 1 ) T ( x t + 1 − x t ) + L 2 | | x t − x t + 1 | | 2 (20) ≤ f ( x t ) − L 2 1 L 2 | | ∇ f ( x t ) | | 2 (21) = f ( x t ) − 1 2 L | | ∇ f ( x t ) | | 2
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· Manus的开源复刻OpenManus初探
· AI 智能体引爆开源社区「GitHub 热点速览」
· C#/.NET/.NET Core技术前沿周刊 | 第 29 期(2025年3.1-3.9)
· 从HTTP原因短语缺失研究HTTP/2和HTTP/3的设计差异