Differentiation
Def. Gradient
f : X ⊆ R N → R f : X ⊆ R N → R is differentiable . Then the gradient of f f at x ∈ X x ∈ X , denoted by ∇ f ( x ) ∇ f ( x ) , is defined by
∇ f ( x ) = ⎡ ⎢
⎢
⎢
⎢ ⎣ ∂ f ∂ x 1 ( x ) ⋮ ∂ f ∂ x N ( x ) ⎤ ⎥
⎥
⎥
⎥ ⎦ ∇ f ( x ) = [ ∂ f ∂ x 1 ( x ) ⋮ ∂ f ∂ x N ( x ) ]
Def. Hessian
f : X ⊆ R N → R f : X ⊆ R N → R is twice differentiable . Then the Hessian of f f at x ∈ X x ∈ X , denoted by ∇ 2 f ( x ) ∇ 2 f ( x ) , is defined by
∇ 2 f ( x ) = ⎡ ⎢
⎢
⎢
⎢
⎢
⎢ ⎣ ∂ 2 f ∂ x 2 1 ( x ) ⋯ ∂ 2 f ∂ x 1 , x N ( x ) ⋮ ⋱ ⋮ ∂ 2 f ∂ x N , x 1 ( x ) ⋯ ∂ 2 f ∂ x 2 N ( x ) ⎤ ⎥
⎥
⎥
⎥
⎥
⎥ ⎦ ∇ 2 f ( x ) = [ ∂ 2 f ∂ x 1 2 ( x ) ⋯ ∂ 2 f ∂ x 1 , x N ( x ) ⋮ ⋱ ⋮ ∂ 2 f ∂ x N , x 1 ( x ) ⋯ ∂ 2 f ∂ x N 2 ( x ) ]
Th. Fermat's theorem
f : X ⊆ R N → R f : X ⊆ R N → R is twice differentiable . If f f admits a local extrememum at x ∗ x ∗ , then ∇ f ( x ∗ ) = 0 ∇ f ( x ∗ ) = 0
Convexity
Def. Convex set
A set X ⊆ R N X ⊆ R N is said to be convex , if for any x , y ∈ X x , y ∈ X , the segment [ x , y ] [ x , y ] lies in X X , that is { α x + ( 1 − α ) y : 0 ≤ α ≤ 1 } ⊆ X { α x + ( 1 − α ) y : 0 ≤ α ≤ 1 } ⊆ X
Th. Operations that preserve convexity
C i C i is convex for all i ∈ I i ∈ I , then ⋂ i ∈ I C i ⋂ i ∈ I C i is also convex
C 1 , C 2 C 1 , C 2 is convex, then C 1 + C 2 = { x 1 + x 2 : x 1 ∈ C 1 , x 2 ∈ X 2 } C 1 + C 2 = { x 1 + x 2 : x 1 ∈ C 1 , x 2 ∈ X 2 } is also convex
C 1 , C 2 C 1 , C 2 is convex, then C 1 × C 2 = { ( x 1 , x 2 ) : x 1 ∈ C 1 , x 2 ∈ X 2 } C 1 × C 2 = { ( x 1 , x 2 ) : x 1 ∈ C 1 , x 2 ∈ X 2 } is also convex
Any projection of a convex set is also convex
Def. Convex hull
The Convex hull conv ( X ) conv ( X ) of set X ⊆ R N X ⊆ R N , is the minimal convex set containing X X , that is
conv ( X ) = { m ∑ i = 1 α i x i : ∀ ( x 1 , ⋯ , x m ) ∈ X , α i ≥ 0 , m ∑ i = 1 α i = 1 } conv ( X ) = { ∑ i = 1 m α i x i : ∀ ( x 1 , ⋯ , x m ) ∈ X , α i ≥ 0 , ∑ i = 1 m α i = 1 }
Def. Epigraph
The epigraph of f : X → R f : X → R , denoted by Epi f Epi f , is defined by { ( x , y ) : x ∈ X , y ≥ f ( x ) } { ( x , y ) : x ∈ X , y ≥ f ( x ) }
Def. Convex function
Convex set X X . A function f : X → R f : X → R is said to be convex , iff Epi f Epi f is convex, or equivalently, for all x , y ∈ X , α ∈ [ 0 , 1 ] x , y ∈ X , α ∈ [ 0 , 1 ]
f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y )
Moreover, f f is said to be strictly convex if the inequality is strict when x ≠ y x ≠ y and α ∈ ( 0 , 1 ) α ∈ ( 0 , 1 ) . f f is said to be (strictly) concave if − f − f is (strictly) convex.
Th. Convex function characterized by first-order differential
f : X ⊆ R N → R f : X ⊆ R N → R is differentiable . Then f f is convex iff dom ( f ) dom ( f ) is convex, and
∀ x , y ∈ dom ( f ) , f ( y ) − f ( x ) ≥ ∇ f ( x ) ⋅ ( y − x ) ∀ x , y ∈ dom ( f ) , f ( y ) − f ( x ) ≥ ∇ f ( x ) ⋅ ( y − x )
交换 x , y x , y ,得到 f ( x ) − f ( y ) ≥ ∇ f ( y ) ⋅ ( x − y ) f ( x ) − f ( y ) ≥ ∇ f ( y ) ⋅ ( x − y ) ,相加得:
⟨ ∇ f ( x ) − ∇ f ( y ) , x − y ⟩ ≥ 0 ⟨ ∇ f ( x ) − ∇ f ( y ) , x − y ⟩ ≥ 0
其含义为 “梯度单调且内积大等于零”,这也是凸性的等价条件之一
Th. Convex function characterized by second-order differential
f : X ⊆ R N → R f : X ⊆ R N → R is twice differentiable . Then f f is convex iff dom ( f ) dom ( f ) is convex, and its Hessian is positive semidefinite (半正定)
∀ x ∈ dom ( f ) , ∇ 2 f ( x ) ⪰ 0 ∀ x ∈ dom ( f ) , ∇ 2 f ( x ) ⪰ 0
对称阵为半正定,若其所有特征值非负;A ⪰ B A ⪰ B 等价于 A − B A − B 为半正定
If f f is scalar (eg. x ↦ x 2 x ↦ x 2 ), then f f is convex iff ∀ x ∈ dom ( f ) , f ′′ ( x ) ≥ 0 ∀ x ∈ dom ( f ) , f ″ ( x ) ≥ 0
For example
Linear functions is both convex and concave
Any norm ∥ ⋅ ∥ ‖ ⋅ ‖ over convex set X X is a convex function
∥ α x + ( 1 − α ) y ∥ ≤ ∥ α x ∥ + ∥ ( 1 − α ) y ∥ ≤ α ∥ x ∥ + ( 1 − α ) ∥ y ∥ ‖ α x + ( 1 − α ) y ‖ ≤ ‖ α x ‖ + ‖ ( 1 − α ) y ‖ ≤ α ‖ x ‖ + ( 1 − α ) ‖ y ‖
Using composition rules to prove convexity
Th. Composition of convex/concave functions
Assume h : R → R h : R → R and g : R N → R g : R N → R are twice differentiable. Define f ( x ) = h ( g ( x ) ) , ∀ x ∈ R N f ( x ) = h ( g ( x ) ) , ∀ x ∈ R N , then
h h is convex & non-decreasing, g g is convex ⟹ ⟹ f f is convex
h h is convex & non-increasing, g g is concave ⟹ ⟹ f f is convex
h h is concave & non-decreasing, g g is concave ⟹ ⟹ f f is concave
h h is concave & non-increasing, g g is convex ⟹ ⟹ f f is concave
Proof: It holds for N = 1 N = 1 , which suffices to prove convexity (concavity) along all lines that intersect the domain.
Example: g g could be any norm ∥ ⋅ ∥ ‖ ⋅ ‖
Th. Pointwise maximum of convex functions
f i f i is a convex function defined over convex set C C for all i ∈ I i ∈ I , then f ( x ) = sup i ∈ I f i ( x ) , x ∈ C f ( x ) = sup i ∈ I f i ( x ) , x ∈ C is a convex function.
Proof: Epi f = ⋂ i ∈ I Epi f i Epi f = ⋂ i ∈ I Epi f i is convex
f ( x ) = max i ∈ I w i ⋅ x + b i f ( x ) = max i ∈ I w i ⋅ x + b i over a convex set, is a convex function
The maximum eigenvalue λ max ( M ) λ max ( M ) over the set of symmetric matrices, is a convex function, since λ max ( M ) = sup ∥ x ∥ 2 ≤ 1 x ′ M x λ max ( M ) = sup ‖ x ‖ 2 ≤ 1 x ′ M x is supremum of linear functions M ↦ x ′ M x M ↦ x ′ M x
More generally, let λ k ( M ) λ k ( M ) denote the top k k eigenvalues, then M ↦ ∑ k i = 1 λ i ( M ) M ↦ ∑ i = 1 k λ i ( M ) and M ↦ ∑ n i = n − k + 1 λ i ( M ) = − ∑ k i = 1 λ i ( − M ) M ↦ ∑ i = n − k + 1 n λ i ( M ) = − ∑ i = 1 k λ i ( − M ) are both convex function
Th. Partial infimum
Convex function f f defined over convex set C ⊆ X × Y C ⊆ X × Y , and conves set B ⊆ Y B ⊆ Y . Then A = { x ∈ X : ∃ y ∈ B , ( x , y ) ∈ C } A = { x ∈ X : ∃ y ∈ B , ( x , y ) ∈ C } is convex set if non-empty, and g ( x ) = inf y ∈ B f ( x , y ) g ( x ) = inf y ∈ B f ( x , y ) for all x ∈ A x ∈ A is convex function.
For example, the distance to convex set B B , d ( x ) = inf y ∈ B ∥ x − y ∥ d ( x ) = inf y ∈ B ‖ x − y ‖ is convex function
Th. Jensen's inequality
Let r.v. X X in convex set C ⊆ R N C ⊆ R N , and convex function f f defined over C C . Then, E [ X ] ∈ C , E [ f ( X ) ] E [ X ] ∈ C , E [ f ( X ) ] is finite, and
f ( E [ X ] ) ≤ E [ f ( X ) ] f ( E [ X ] ) ≤ E [ f ( X ) ]
Sketch of proof: extending f ( ∑ α x ) ≤ ∑ α f ( x ) f ( ∑ α x ) ≤ ∑ α f ( x ) and ∑ α = 1 ∑ α = 1 that can be interpreted as probabilities, to arbitraty contributions.
Smoothness, strong convexity
参考 [https://zhuanlan.zhihu.com/p/619288199 ]
考虑二阶导的 lipschitz 连续性
Def. β β -smooth
称函数 f f 是 β β -smooth 的,若
∀ x , y ∈ dom ( f ) , ∥ ∇ f ( x ) − ∇ f ( y ) ∥ ≤ β ∥ x − y ∥ ∀ x , y ∈ dom ( f ) , ‖ ∇ f ( x ) − ∇ f ( y ) ‖ ≤ β ‖ x − y ‖
等价于如下命题均成立:
β 2 ∥ x ∥ 2 − f ( x ) β 2 ‖ x ‖ 2 − f ( x ) 是凸函数
∀ x , y ∈ dom ( f ) , f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + β 2 ∥ y − x ∥ 2 ∀ x , y ∈ dom ( f ) , f ( y ) ≤ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + β 2 ‖ y − x ‖ 2
∇ 2 f ( x ) ⪯ β I ∇ 2 f ( x ) ⪯ β I
证明/说明
证明 g ( x ) = β 2 ∥ x ∥ 2 − f ( x ) g ( x ) = β 2 ‖ x ‖ 2 − f ( x ) 为凸,可以考虑 ⟨ ∇ g ( x ) − ∇ g ( y ) , x − y ⟩ ≥ 0 ⟨ ∇ g ( x ) − ∇ g ( y ) , x − y ⟩ ≥ 0 ,应用柯西不等式可证
感性理解之,f f 的起伏 “拗不过” β 2 ∥ x ∥ 2 β 2 ‖ x ‖ 2 的凸性,即 f f 起伏不够大,也就是比较平滑 smooth
证明第二、三条,代入 g ( y ) − g ( x ) ≥ ∇ g ( x ) ⋅ ( y − x ) g ( y ) − g ( x ) ≥ ∇ g ( x ) ⋅ ( y − x ) 和 ∇ 2 g ( x ) ⪰ 0 ∇ 2 g ( x ) ⪰ 0 即可;它的几何含义见下
Def. α α -strongly convex
称函数 f f 是 α α -strongly convex 的,若
∀ x , y ∈ dom ( f ) , ∥ ∇ f ( x ) − ∇ f ( y ) ∥ ≥ α ∥ x − y ∥ ∀ x , y ∈ dom ( f ) , ‖ ∇ f ( x ) − ∇ f ( y ) ‖ ≥ α ‖ x − y ‖
f ( x ) − α 2 ∥ x ∥ 2 f ( x ) − α 2 ‖ x ‖ 2 是凸函数
∀ x , y ∈ dom ( f ) , f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + α 2 ∥ y − x ∥ 2 ∀ x , y ∈ dom ( f ) , f ( y ) ≥ f ( x ) + ∇ f ( x ) ⊤ ( y − x ) + α 2 ‖ y − x ‖ 2
∇ 2 f ( x ) ⪰ α I ∇ 2 f ( x ) ⪰ α I
Def. γ γ -well-conditioned
称函数 f f 是 γ γ -well-conditioned 的,若其同时是 α α -strongly convex 和 β β -smooth 的;定义 f f 的 condition number 为 γ = α / β ≤ 1 γ = α / β ≤ 1
Th. Linear combination of two convex functions
考虑两个凸函数的加和,有:
若 f f 为 α 1 α 1 -strongly convex,g g 为 α 2 α 2 -strongly convex,则 f + g f + g 为 ( α 1 + α 2 ) ( α 1 + α 2 ) -strongly convex
若 f f 为 β 1 β 1 -smooth,g g 为 β 2 β 2 -smooth,则 f + g f + g 为 ( β 1 + β 2 ) ( β 1 + β 2 ) -smooth
考虑凸函数的数乘 k > 0 k > 0 ,有:
若 f f 为 α α -strongly convex,则 k f k f 为 ( k α ) ( k α ) -strongly convex
若 f f 为 β β -smooth,则 k f k f 为 ( k β ) ( k β ) -smooth
证明,利用凸函数满足 ⟨ ∇ f ( x ) − ∇ f ( y ) , x − y ⟩ ≥ 0 ⟨ ∇ f ( x ) − ∇ f ( y ) , x − y ⟩ ≥ 0 和 β 2 ∥ x ∥ 2 − f ( x ) , f ( x ) − α 2 ∥ x ∥ 2 β 2 ‖ x ‖ 2 − f ( x ) , f ( x ) − α 2 ‖ x ‖ 2 的凸性即可
Projections onto convex sets
之后的算法会涉及向凸集投影的概念;定义 y y 向凸集 K K 的投影,为
∏ K ( y ) ≜ arg min x ∈ K ∥ x − y ∥ ∏ K ( y ) ≜ arg min x ∈ K ‖ x − y ‖
可以证明投影总是唯一的;投影还具有一个很重要的性质:
Th. Pythagorean theorem 勾股定理
凸集 K ⊆ R d , y ∈ R d , x = ∏ K ( y ) K ⊆ R d , y ∈ R d , x = ∏ K ( y ) ,则任意 z ∈ K z ∈ K ,∥ y − z ∥ ≥ ∥ x − z ∥ ‖ y − z ‖ ≥ ‖ x − z ‖
即,对凸集内的任一点,其到投影点的距离不大于其到被投影点的距离
Constrained optimization 带约束优化
Def. Constrained optimization problem
X ⊆ R N , f , g i : X → R , i ∈ [ m ] X ⊆ R N , f , g i : X → R , i ∈ [ m ] ,则带约束优化问题(也称为 primal problem )的形式为
min x ∈ X f ( x ) s u b j e c t t o : g i ( x ) ≤ 0 , ∀ i ∈ [ m ] min x ∈ X f ( x ) s u b j e c t t o : g i ( x ) ≤ 0 , ∀ i ∈ [ m ]
记 inf x ∈ X f ( x ) = p ∗ inf x ∈ X f ( x ) = p ∗ ;注意到目前我们没有假设任何的 convexity;对于 g = 0 g = 0 的约束我们可以用 g ≤ 0 , − g ≤ 0 g ≤ 0 , − g ≤ 0 来刻画
Dual problem and saddle point
解决这类问题,可以先引入拉格朗日函数 Lagrange function ,将约束以非正项引入;然后转化成对偶问题
Def. Lagrange function
为带约束优化问题定义拉格朗日函数,为
∀ x ∈ X , ∀ α ≥ 0 , L ( x , α ) = f ( x ) + m ∑ i = 1 α i g i ( x ) ∀ x ∈ X , ∀ α ≥ 0 , L ( x , α ) = f ( x ) + ∑ i = 1 m α i g i ( x )
其中 α = ( α 1 , ⋯ , α m ) ′ α = ( α 1 , ⋯ , α m ) ′ 称为对偶变量 dual variable
对于约束 g = 0 g = 0 ,其系数 α = α + − α − α = α + − α − 不需要非负(但是下文给出定理时,要求 g , − g g , − g 同时为凸,从而 g g 得是仿射函数 affine,即形如 w ⋅ x + b w ⋅ x + b )
注意到 p ∗ = inf x sup α L ( x , α ) p ∗ = inf x sup α L ( x , α ) ,因为当 x x 不满足约束时 sup α sup α 可以取到无穷大,从而刻画了约束
有趣的来了,我们能构造一个 concave function,称为对偶函数 Dual function
Def. Dual function
为带约束优化问题定义对偶函数,为
∀ α ≥ 0 , F ( α ) = inf x ∈ X L ( x , α ) = inf x ∈ X ( f ( x ) + m ∑ i = 1 α i g i ( x ) ) ∀ α ≥ 0 , F ( α ) = inf x ∈ X L ( x , α ) = inf x ∈ X ( f ( x ) + ∑ i = 1 m α i g i ( x ) )
它是 concave 的,因为 L L 是关于 α α 的线性函数,且 pointwise infimum 保持了 concavity
同时注意到对任意 α α ,F ( α ) ≤ inf x ∈ X f ( x ) = p ∗ F ( α ) ≤ inf x ∈ X f ( x ) = p ∗
定义对偶问题
Def. Dual problem
为带约束优化问题定义对偶问题,为
max α F ( α ) s u b j e c t t o : α ≥ 0 max α F ( α ) s u b j e c t t o : α ≥ 0
对偶问题是凸优化问题,即求 concave 函数的最大值,记其为 d ∗ d ∗ ;由上文可知 d ∗ ≤ p ∗ d ∗ ≤ p ∗ ,也就是:
d ∗ = sup α inf x L ( x , α ) ≤ inf x sup α L ( x , α ) = p ∗ d ∗ = sup α inf x L ( x , α ) ≤ inf x sup α L ( x , α ) = p ∗
称为弱对偶 weak duality ,取等情况称为强对偶 strong duality
接下来会给出:
当凸优化问题 满足约束规范性条件 constraint qualification (Slater's contidion) 时(此为充分条件),有 d ∗ = p ∗ d ∗ = p ∗ ,且该解的充要条件为拉格朗日函数的鞍点 saddle point
Def. Constraint qualification (Slater's condition)
假设集合 X X 的内点非空 int ( X ) ≠ ∅ int ( X ) ≠ ∅ :
定义 strong constraint qualification (Slater's condition) 为
∃ ¯ x ∈ int ( X ) : ∀ i ∈ [ m ] , g i ( ¯ x ) < 0 ∃ x ¯ ∈ int ( X ) : ∀ i ∈ [ m ] , g i ( x ¯ ) < 0
定义 weak constraint qualification (weak Slater's condition) 为
∃ ¯ x ∈ int ( X ) : ∀ i ∈ [ m ] , ( g i ( ¯ x ) < 0 ) ∨ ( g i ( ¯ x ) = 0 ∧ g i affine ) ∃ x ¯ ∈ int ( X ) : ∀ i ∈ [ m ] , ( g i ( x ¯ ) < 0 ) ∨ ( g i ( x ¯ ) = 0 ∧ g i affine )
(这个条件是在说明解存在吗?)
基于 Slater's condition,叙述拉格朗日函数的鞍点 saddle point 是带约束优化问题的解的充要条件
Th. Saddle point - sufficient condition
带约束优化问题,如果其拉格朗日函数存在鞍点 saddle point ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) ,即
∀ x ∈ X , ∀ α ≥ 0 , L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x , α ∗ ) ∀ x ∈ X , ∀ α ≥ 0 , L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x , α ∗ )
则 x ∗ x ∗ 是该问题的解,f ( x ∗ ) = inf f ( x ) f ( x ∗ ) = inf f ( x )
Th. Saddle point - necessary condition
假设 f , g i , i ∈ [ m ] f , g i , i ∈ [ m ] 为 convex function :
若满足 Slater's condition ,则带约束优化问题的解 x ∗ x ∗ 满足存在 α ∗ ≥ 0 α ∗ ≥ 0 使得 ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) 是拉格朗日函数的鞍点
若满足 weak Slater's condition 且 f , g i f , g i 可导 ,则带约束优化问题的解 x ∗ x ∗ 满足存在 α ∗ ≥ 0 α ∗ ≥ 0 使得 ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) 是拉格朗日函数的鞍点
由于书本上没提供必要性的证明,且充分性证明不难但是不够漂亮,所以就不抄了,只给出我自己的思路(虽然可能有缺陷,下图也只是示意):
回到最初的不等式:
d ∗ = sup α inf x L ( x , α ) ≤ inf x sup α L ( x , α ) = p ∗ d ∗ = sup α inf x L ( x , α ) ≤ inf x sup α L ( x , α ) = p ∗
定义 x ∗ x ∗ 取到 p ∗ = sup α L ( x ∗ , α ) ≤ sup α L ( x , α ) , ∀ x p ∗ = sup α L ( x ∗ , α ) ≤ sup α L ( x , α ) , ∀ x (可能有多个)
定义 α ∗ α ∗ 取到 d ∗ = inf x L ( x , α ∗ ) ≥ inf x L ( x , α ) , ∀ α d ∗ = inf x L ( x , α ∗ ) ≥ inf x L ( x , α ) , ∀ α (可能有多个)
(鞍点)唯一性和函数凸性有关(虽然感觉不严格凸的话可以是一片“平”的区域),留给之后再说吧
试证明:
L L 存在鞍点、存在一组 ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) 是鞍点、p ∗ = d ∗ p ∗ = d ∗ 、p ∗ = d ∗ = L ( x ∗ , α ∗ ) p ∗ = d ∗ = L ( x ∗ , α ∗ ) 四者等价
分别记为命题 A , B , C , D A , B , C , D ,显然有 B → A , D → C B → A , D → C
证明 A → B , B → C D A → B , B → C D :
若存在鞍点(思路见图右上),记鞍点为 ( x ′ , α ′ ) ( x ′ , α ′ ) ,则 L ( x ′ , α ∗ ) ≤ L ( x ′ , α ′ ) = inf x L ( x , α ′ ) ≤ inf x L ( x , α ∗ ) L ( x ′ , α ∗ ) ≤ L ( x ′ , α ′ ) = inf x L ( x , α ′ ) ≤ inf x L ( x , α ∗ ) ,观察不等式两头则取等;进而有 inf x L ( x , α ∗ ) ≤ L ( x ′ , α ∗ ) = inf x L ( x , α ′ ) inf x L ( x , α ∗ ) ≤ L ( x ′ , α ∗ ) = inf x L ( x , α ′ ) ,根据 α ∗ α ∗ 定义该不等式取等,于是可以令 α ∗ ← α ′ α ∗ ← α ′ ;接下来对 x ∗ , x ′ x ∗ , x ′ 同理
此时 ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) 是鞍点,即 ∀ x , ∀ α , L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x , α ∗ ) ∀ x , ∀ α , L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x , α ∗ ) ,则有 p ∗ = sup α L ( x ∗ , α ) = L ( x ∗ , α ∗ ) = inf x L ( x , α ∗ ) = d ∗ p ∗ = sup α L ( x ∗ , α ) = L ( x ∗ , α ∗ ) = inf x L ( x , α ∗ ) = d ∗
证明 C → B D C → B D :
若 p ∗ = d ∗ p ∗ = d ∗ ,即 sup α L ( x ∗ , α ) = inf x L ( x , α ∗ ) sup α L ( x ∗ , α ) = inf x L ( x , α ∗ ) ,且 sup α L ( x ∗ , α ) ≥ L ( x ∗ , α ∗ ) ≥ inf x L ( x , α ∗ ) sup α L ( x ∗ , α ) ≥ L ( x ∗ , α ∗ ) ≥ inf x L ( x , α ∗ ) ;故三者取等,故 ∀ x , ∀ α , L ( x ∗ , α ) ≤ sup α L ( x ∗ , α ) = L ( x ∗ , α ∗ ) = inf x L ( x , α ∗ ) ≤ L ( x , α ∗ ) ∀ x , ∀ α , L ( x ∗ , α ) ≤ sup α L ( x ∗ , α ) = L ( x ∗ , α ∗ ) = inf x L ( x , α ∗ ) ≤ L ( x , α ∗ ) ,从而 ( x ∗ , α ∗ ) ( x ∗ , α ∗ ) 是鞍点
综上得证。爽!
KKT conditions
Lagrangian version
若带约束优化问题满足 convexity,我们就可以用一个定理解决:KKT
Th. Karush-Kuhn-Tucker's theorem
假设 f , g i : X → R , ∀ i ∈ [ m ] f , g i : X → R , ∀ i ∈ [ m ] ,为 convex and differentiable ,且满足 Slater's condition ;则带约束优化问题
min x ∈ X , g ( x ) ≤ 0 f ( x ) min x ∈ X , g ( x ) ≤ 0 f ( x )
其拉格朗日函数为 L ( x , α ) = f ( x ) + α ⋅ g ( x ) , α ≥ 0 L ( x , α ) = f ( x ) + α ⋅ g ( x ) , α ≥ 0
则 ¯ x x ¯ 是该问题的解,当且仅当存在 ¯ α ≥ 0 α ¯ ≥ 0 ,满足:
∇ x L ( ¯ x , ¯ α ) = ∇ x f ( ¯ x ) + ¯ α ⋅ ∇ x g ( ¯ x ) = 0 ∇ α L ( ¯ x , ¯ α ) = g ( ¯ x ) ≤ 0 ¯ α ⋅ g ( ¯ x ) = 0 ; KKT conditions ∇ x L ( x ¯ , α ¯ ) = ∇ x f ( x ¯ ) + α ¯ ⋅ ∇ x g ( x ¯ ) = 0 ∇ α L ( x ¯ , α ¯ ) = g ( x ¯ ) ≤ 0 α ¯ ⋅ g ( x ¯ ) = 0 ; KKT conditions
其中后两条称为互补条件 complementarity conditions ,即对任意 i ∈ [ m ] , ¯ α i ≥ 0 , g i ( ¯ x ) ≤ 0 i ∈ [ m ] , α ¯ i ≥ 0 , g i ( x ¯ ) ≤ 0 ,且满足 ¯ α i g i ( ¯ x ) = 0 α ¯ i g i ( x ¯ ) = 0
充要性证明:
必要性,¯ x x ¯ 为解,则存在 ¯ α α ¯ 使得 ( ¯ x , ¯ α ) ( x ¯ , α ¯ ) 为鞍点,从而得到 KKT 条件:第一条即鞍点定义,第二、三条:
∀ α , L ( ¯ x , α ) ≤ L ( ¯ x , ¯ α ) ⟹ α ⋅ g ( ¯ x ) ≤ ¯ α ⋅ g ( ¯ x ) α → + ∞ ⟹ g ( ¯ x ) ≤ 0 α → 0 ⟹ ¯ α ⋅ g ( ¯ x ) = 0 ∀ α , L ( x ¯ , α ) ≤ L ( x ¯ , α ¯ ) ⟹ α ⋅ g ( x ¯ ) ≤ α ¯ ⋅ g ( x ¯ ) α → + ∞ ⟹ g ( x ¯ ) ≤ 0 α → 0 ⟹ α ¯ ⋅ g ( x ¯ ) = 0
充分性,满足 KKT 条件,则对于满足 g ( x ) ≤ 0 g ( x ) ≤ 0 的 x x :
f ( x ) − f ( ¯ x ) ≥ ∇ x f ( ¯ x ) ⋅ ( x − ¯ x ) ; convexity of f = − ¯ α ⋅ ∇ x g ( ¯ x ) ⋅ ( x − ¯ x ) ; first cond ≥ − ¯ α ⋅ ( g ( x ) − g ( ¯ x ) ) ; convexity of g = − ¯ α ⋅ g ( x ) ≥ 0 ; third cond f ( x ) − f ( x ¯ ) ≥ ∇ x f ( x ¯ ) ⋅ ( x − x ¯ ) ; convexity of f = − α ¯ ⋅ ∇ x g ( x ¯ ) ⋅ ( x − x ¯ ) ; first cond ≥ − α ¯ ⋅ ( g ( x ) − g ( x ¯ ) ) ; convexity of g = − α ¯ ⋅ g ( x ) ≥ 0 ; third cond
Gradient descent version
我们还能用另一种思路阐述 KKT 定理,并且也是另一种求解带约束优化问题的方法:梯度下降
由于 g g 均为凸函数,显然 K = { x : g ( x ) ≤ 0 } K = { x : g ( x ) ≤ 0 } 为凸集;因此约束其实就是把取值限定在凸集 K K 上,于是有:
Th. Karush-Kuhn-Tucker's theorem, gradient descent version
假设 f f 为 convex and differentiable ,K K 为凸集;则带约束优化问题
min x ∈ K f ( x ) min x ∈ K f ( x )
则 x ∗ x ∗ 是该问题的解,当且仅当
∀ y ∈ K , − ∇ f ( x ∗ ) ⊤ ( y − x ∗ ) ≤ 0 ∀ y ∈ K , − ∇ f ( x ∗ ) ⊤ ( y − x ∗ ) ≤ 0
其思想为,负梯度方向为 f f 函数值下降的方向,若其与 y − x ∗ y − x ∗ 有相同方向的分量(内积大于零),则可以沿该方向移动,则 x ∗ x ∗ 不会是最优点
这个定理为我们的梯度下降算法提供了基础
Gradient descent
Unconstrained case
无约束凸优化问题的梯度下降 GD 算法
记 ∇ t = ∇ f ( x t ) , h t = f ( x t ) − f ( x ∗ ) , d t = ∥ x t − x ∗ ∥ ∇ t = ∇ f ( x t ) , h t = f ( x t ) − f ( x ∗ ) , d t = ‖ x t − x ∗ ‖
Algorithm. Gradient descent – –––––––––––––––––––––––––––––––––––––––––––––––––––––– – Input T , x 0 , { η t } for t = 0 , ⋯ , T − 1 do x t + 1 = x t − η t ∇ t end for return ¯ x = argmin x t { f ( x t ) } Algorithm. Gradient descent _ Input T , x 0 , { η t } for t = 0 , ⋯ , T − 1 do x t + 1 = x t − η t ∇ t end for return x ¯ = argmin x t { f ( x t ) }
其中合理地选择 η t η t ,决定了算法的效率;取 Polyak stepsize :η t = h t ∥ ∇ t ∥ 2 η t = h t ‖ ∇ t ‖ 2 ,有:
Th. Bound for GD with Polyak stepsize
假设 ∥ ∇ t ∥ ≤ G ‖ ∇ t ‖ ≤ G ,则
f ( ¯ x ) − f ( x ∗ ) = min 0 ≤ t ≤ T { h t } ≤ min { G d 0 √ T , 2 β d 2 0 T , 3 G 2 α T , β d 2 0 ( 1 − γ 4 ) T } f ( x ¯ ) − f ( x ∗ ) = min 0 ≤ t ≤ T { h t } ≤ min { G d 0 T , 2 β d 0 2 T , 3 G 2 α T , β d 0 2 ( 1 − γ 4 ) T }
该定理的证明建立在 d 2 t + 1 ≤ d 2 t − h 2 t / ∥ ∇ t ∥ 2 d t + 1 2 ≤ d t 2 − h t 2 / ‖ ∇ t ‖ 2 上,这就感觉很扯淡了,因为用绝对值不等式放缩出来总是反的...
证明算法的效率时,可以用一些 “potential 势差” 函数来刻画,比如刻画势能 h t = f ( x t ) − f ( x ∗ ) h t = f ( x t ) − f ( x ∗ ) 的降低,考虑势能差 h t + 1 − h t h t + 1 − h t 、梯度的范数 ∥ ∇ t ∥ ‖ ∇ t ‖ ;比如到最优点的距离 d t = ∥ x − x ∗ ∥ d t = ‖ x − x ∗ ‖ 的降低,考虑 d t + 1 − d t d t + 1 − d t ;
先给出引理,代入 smooth 和 strong convexity 可以证明;这些式子方便我们后续进行放缩
α 2 d 2 t ≤ h t ≤ β 2 d 2 t 1 2 β ∥ ∇ t ∥ 2 ≤ h t ≤ 1 2 α ∥ ∇ t ∥ 2 α 2 d t 2 ≤ h t ≤ β 2 d t 2 1 2 β ‖ ∇ t ‖ 2 ≤ h t ≤ 1 2 α ‖ ∇ t ‖ 2
对于更新式 x t + 1 = x t − η t ∇ t x t + 1 = x t − η t ∇ t ,如果考虑 d t + 1 − d t d t + 1 − d t ,那就两边同减 x ∗ x ∗ 并取范数,用三角不等式放缩,但是符号和上文的假设是反的,到此就证不下去了;如果假定 d 2 t + 1 ≤ d 2 t − h 2 t / ∥ ∇ t ∥ 2 d t + 1 2 ≤ d t 2 − h t 2 / ‖ ∇ t ‖ 2 ,那就能证明上面的 bound(注意 f ( ¯ x ) − f ( x ∗ ) ≤ 1 T ∑ t h t f ( x ¯ ) − f ( x ∗ ) ≤ 1 T ∑ t h t 可以不等式放缩)
不过,我们可以从 h t + 1 − h t h t + 1 − h t 出发,有:
h t + 1 − h t = f ( x t + 1 ) − f ( x t ) ≤ ∇ f ( x t ) ⊤ ( x t + 1 − x t ) + β 2 ∥ x t + 1 − x t ∥ 2 = − η t ∥ ∇ t ∥ 2 + β 2 η 2 t ∥ ∇ t ∥ 2 = − 1 2 β ∥ ∇ t ∥ 2 ; 令 η t = 1 β ≤ − α β h t = − γ h t h t + 1 − h t = f ( x t + 1 ) − f ( x t ) ≤ ∇ f ( x t ) ⊤ ( x t + 1 − x t ) + β 2 ‖ x t + 1 − x t ‖ 2 = − η t ‖ ∇ t ‖ 2 + β 2 η t 2 ‖ ∇ t ‖ 2 = − 1 2 β ‖ ∇ t ‖ 2 ; 令 η t = 1 β ≤ − α β h t = − γ h t
于是 h T ≤ ( 1 − γ ) h T − 1 ≤ ( 1 − γ ) T h 0 ≤ e − γ T h 0 h T ≤ ( 1 − γ ) h T − 1 ≤ ( 1 − γ ) T h 0 ≤ e − γ T h 0 ,这倒是个很不错的收敛保证,因此有
Th. Bound for GD, unconstrained case
假设 f f 是 γ γ -well-conditioned,令 η t = 1 β η t = 1 β ,则
h T ≤ e − γ T h 0 h T ≤ e − γ T h 0
Constrained case
带约束优化的梯度下降,只需要每次移动后,投影到凸集
Algorithm. Basic gradient descent – –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– – Input T , x 0 ∈ K , { η t } for t = 0 , ⋯ , T − 1 do x t + 1 = Π K ( x t − η t ∇ t ) end for return ¯ x = argmin x t { f ( x t ) } Algorithm. Basic gradient descent _ Input T , x 0 ∈ K , { η t } for t = 0 , ⋯ , T − 1 do x t + 1 = Π K ( x t − η t ∇ t ) end for return x ¯ = argmin x t { f ( x t ) }
它有个类似的上限:
Th. Bound for GD, constrained case
假设 f f 是 γ γ -well-conditioned,令 η t = 1 β η t = 1 β ,则
h T ≤ e − γ T / 4 h 0 h T ≤ e − γ T / 4 h 0
证明
首先是投影的定义
x t + 1 = Π K ( x t − η t ∇ t ) = argmin x ∥ x − x t + η t ∇ t ∥ = argmin x ( ∇ ⊤ t ( x − x t ) + 1 2 η t ∥ x − x t ∥ 2 ) x t + 1 = Π K ( x t − η t ∇ t ) = argmin x ‖ x − x t + η t ∇ t ‖ = argmin x ( ∇ t ⊤ ( x − x t ) + 1 2 η t ‖ x − x t ‖ 2 )
根据如下式子,可以令 η t = 1 β η t = 1 β ;从而有:
h t + 1 − h t = f ( x t + 1 ) − f ( x t ) ≤ ∇ ⊤ t ( x t + 1 − x t ) + β 2 ∥ x t + 1 − x t ∥ 2 = min x ( ∇ ⊤ t ( x − x t ) + β 2 ∥ x − x t ∥ 2 ) h t + 1 − h t = f ( x t + 1 ) − f ( x t ) ≤ ∇ t ⊤ ( x t + 1 − x t ) + β 2 ‖ x t + 1 − x t ‖ 2 = min x ( ∇ t ⊤ ( x − x t ) + β 2 ‖ x − x t ‖ 2 )
为了摘掉 min min ,我们可以代入某个 x x ,代入哪个呢?可以考虑两点的连线,即 ( 1 − μ ) x t + μ x ∗ ( 1 − μ ) x t + μ x ∗ :
h t + 1 − h t ≤ min x ∈ [ x t , x ∗ ] ( ∇ ⊤ t ( x − x t ) + β 2 ∥ x − x t ∥ 2 ) ≤ μ ∇ ⊤ t ( x ∗ − x t ) + μ 2 β 2 ∥ x ∗ − x t ∥ 2 ≤ − μ h t + μ 2 β − α 2 ∥ x ∗ − x t ∥ 2 ; α -strong convex ≤ − μ h t + μ 2 β − α α h t ; Lemma h t + 1 − h t ≤ min x ∈ [ x t , x ∗ ] ( ∇ t ⊤ ( x − x t ) + β 2 ‖ x − x t ‖ 2 ) ≤ μ ∇ t ⊤ ( x ∗ − x t ) + μ 2 β 2 ‖ x ∗ − x t ‖ 2 ≤ − μ h t + μ 2 β − α 2 ‖ x ∗ − x t ‖ 2 ; α -strong convex ≤ − μ h t + μ 2 β − α α h t ; Lemma
对 μ ∈ [ 0 , 1 ] μ ∈ [ 0 , 1 ] 取极值,得到
h t + 1 ≤ h t ( 1 − α 4 ( β − α ) ) ≤ h t ( 1 − γ 4 ) ≤ h t e − γ / 4 h t + 1 ≤ h t ( 1 − α 4 ( β − α ) ) ≤ h t ( 1 − γ 4 ) ≤ h t e − γ / 4
从而得证
GD: Reductions to non-smooth and non-strongly convex functions
现在来考虑梯度下降对不一定 smooth、或不一定 strong convex 的凸函数时该怎么分析;下文提到的 reduction 方法可以导出近似最优的收敛速度,而且很简单、很普适
Case 1. reduction to smooth, non-strongly convex functions
考虑仅有 β β -smooth 情况;不过实际上,凸函数都是 0 0 -strongly convex
做法为,加一个适当的 strongly-convex 函数,将原函数扳成更加 strongly-convex 的
Algorithm. Gradient descent, reduction to β -smooth functions – –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– – Input f , T , x 0 ∈ K , parameter ~ α Let g ( x ) = f ( x ) + ~ α 2 ∥ x − x 0 ∥ 2 Apply GD on g , T , { η t = 1 β } , x 0 Algorithm. Gradient descent, reduction to β -smooth functions _ Input f , T , x 0 ∈ K , parameter α ~ Let g ( x ) = f ( x ) + α ~ 2 ‖ x − x 0 ‖ 2 Apply GD on g , T , { η t = 1 β } , x 0
取 ~ α = β log T D 2 T α ~ = β log T D 2 T 时,这个算法的效率为 h T = O ( β log T T ) h T = O ( β log T T ) ;对 GD 多加处理可以做到 O ( β / T ) O ( β / T )
由于 f f 是 β β -smooth 和 0 0 -strongly convex,加上一个 ~ α α ~ -smooth、~ α α ~ -strongly convex 的 ∥ x − x 0 ∥ 2 ‖ x − x 0 ‖ 2 ,由上述提到的凸函数求和的定理,g g 为 ( β + ~ α ) ( β + α ~ ) -smooth 和 ~ α α ~ -strongly convex
因此,对于 f f ,有:
h t = f ( x t ) − f ( x ∗ ) = g ( x t ) − g ( x ∗ ) + ~ α 2 ( ∥ x ∗ − x 0 ∥ 2 − ∥ x t − x 0 ∥ 2 ) ≤ h g 0 exp − ~ α t / 4 ( ~ α + β ) + ~ α D 2 ; D is diameter of bounded K = O ( β log t t ) ; choosing ~ α = β log t D 2 t , ignore some constants h t = f ( x t ) − f ( x ∗ ) = g ( x t ) − g ( x ∗ ) + α ~ 2 ( ‖ x ∗ − x 0 ‖ 2 − ‖ x t − x 0 ‖ 2 ) ≤ h 0 g exp − α ~ t / 4 ( α ~ + β ) + α ~ D 2 ; D is diameter of bounded K = O ( β log t t ) ; choosing α ~ = β log t D 2 t , ignore some constants
Case 2. reduction to strongly convex, non-smooth functions
考虑仅有 α α -strongly convex 的情况,考虑将其改造得更平缓的方法——平滑操作
最简单的平滑操作就是邻域取平均,记 f f 平滑后的函数为 ^ f δ : R d → R f ^ δ : R d → R ,记 B = { v : ∥ v ∥ ≤ 1 } B = { v : ‖ v ‖ ≤ 1 } ,取半径为 δ δ 的球域做平均,用期望表示为:
^ f δ ( x ) = E v ∼ U ( B ) [ f ( x + δ v ) ] f ^ δ ( x ) = E v ∼ U ( B ) [ f ( x + δ v ) ]
这种平滑方法具有如下性质,假设 f f 是 G G -Lipschitz 连续:
若 f f 是 α α -strongly convex,则 ~ f δ f ~ δ 也是 α α -strongly convex
^ f δ f ^ δ 是 ( d G / δ ) ( d G / δ ) -smooth
任意 x ∈ K x ∈ K ,| ^ f δ ( x ) − f ( x ) | ≤ δ G | f ^ δ ( x ) − f ( x ) | ≤ δ G
证明:
第一条利用凸函数的线性组合,对于 ^ f δ ( x ) = ∫ v Pr [ v ] f ( x + δ v ) d v f ^ δ ( x ) = ∫ v Pr [ v ] f ( x + δ v ) d v ,由于任意 v v ,函数 f ( x + δ v ) f ( x + δ v ) 都是 α α -strongly convex 的,因此考察强凸性时,直接提出得到 α ∫ v Pr [ v ] d v = α α ∫ v Pr [ v ] d v = α ;同时可见即使不是均匀分布,依然是这个结论
第二条利用斯托克斯公式,由于是均匀分布,可以将其转化为球面 S = { v : ∥ v ∥ = 1 } S = { v : ‖ v ‖ = 1 } 上的积分
E v ∼ S [ f ( x + δ v ) v ] = δ d ∇ ^ f δ ( x ) E v ∼ S [ f ( x + δ v ) v ] = δ d ∇ f ^ δ ( x )
再利用 ∥ ∇ f ( x ) − ∇ f ( y ) ∥ ≤ ∥ x − y ∥ ‖ ∇ f ( x ) − ∇ f ( y ) ‖ ≤ ‖ x − y ‖ 可证明 smoothness,其步骤和第三条的证明类似
第三条的证明为:
| ^ f δ ( x ) − f ( x ) | = ∣ ∣ E v ∼ U ( B ) [ f ( x + δ v ) ] − f ( x ) ∣ ∣ ≤ E v ∼ U ( B ) [ | f ( x + δ v ) − f ( x ) | ] ; Jensen ≤ E v ∼ U ( B ) [ G ∥ δ v ∥ ] ; Lipschitz ≤ G δ | f ^ δ ( x ) − f ( x ) | = | E v ∼ U ( B ) [ f ( x + δ v ) ] − f ( x ) | ≤ E v ∼ U ( B ) [ | f ( x + δ v ) − f ( x ) | ] ; Jensen ≤ E v ∼ U ( B ) [ G ‖ δ v ‖ ] ; Lipschitz ≤ G δ
从而算法为:
Algorithm. Gradient descent, reduction to non-smooth functions – ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– – Input f , T , x 0 ∈ K , parameter δ Let ^ f δ ( x ) = E v ∼ U ( B ) [ f ( x + δ v ) ] Apply GD on g , T , { η t = δ } , x 0 Algorithm. Gradient descent, reduction to non-smooth functions _ Input f , T , x 0 ∈ K , parameter δ Let f ^ δ ( x ) = E v ∼ U ( B ) [ f ( x + δ v ) ] Apply GD on g , T , { η t = δ } , x 0
取 δ = d G α log t t δ = d G α log t t 时,该算法的效率为 h T = O ( G 2 d log t α t ) h T = O ( G 2 d log t α t )
暂不考虑如何计算 ^ f δ f ^ δ 的梯度,后面会给出估计方法
首先 ^ f δ f ^ δ 是 α δ d G α δ d G -well-conditioned
h t = f ( x t ) − f ( x ∗ ) ≤ ^ f δ ( x t ) − ^ f δ ( x ∗ ) + 2 δ G ; for | ^ f δ ( x ) − f ( x ) | ≤ δ G ≤ h 0 e − α δ t 4 d G + 2 δ G = O ( d G 2 log t α t ) ; δ = d G α log t t h t = f ( x t ) − f ( x ∗ ) ≤ f ^ δ ( x t ) − f ^ δ ( x ∗ ) + 2 δ G ; for | f ^ δ ( x ) − f ( x ) | ≤ δ G ≤ h 0 e − α δ t 4 d G + 2 δ G = O ( d G 2 log t α t ) ; δ = d G α log t t
另外,如果在原函数 f f 直接做 GD 的话,依然有收敛保证,但是我们需要取序列的加权和:
令 η t = 2 α ( t + 1 ) η t = 2 α ( t + 1 ) ,得到迭代序列 x 1 , ⋯ , x t x 1 , ⋯ , x t ,则
f ( 1 t t ∑ k = 1 2 k t + 1 x k ) − f ( x ∗ ) ≤ 2 G 2 α ( t + 1 ) f ( 1 t ∑ k = 1 t 2 k t + 1 x k ) − f ( x ∗ ) ≤ 2 G 2 α ( t + 1 )
证明略
Case 3. reduction to general convex functions (non-smooth, non-strongly convex)
如果同时使用上述两个方法,会得到一个 ~ O ( d / √ t ) O ~ ( d / t ) 的方法,不过它得依赖 d d
在 OCO 问题中会给出一个 O ( 1 / √ t ) O ( 1 / t ) 更一般算法
Fenchel duality
凸优化问题,对 f f 不可导或者有无穷大的值的情况进行分析
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)
2021-08-19 【笔记】Splay Tree