【笔记】凸优化 Convex Optimization

Differentiation

Def. Gradient
f:XRNR is differentiable. Then the gradient of f at xX, denoted by f(x), is defined by

f(x)=[fx1(x)fxN(x)]

Def. Hessian
f:XRNR is twice differentiable. Then the Hessian of f at xX, denoted by 2f(x), is defined by

2f(x)=[2fx12(x)2fx1,xN(x)2fxN,x1(x)2fxN2(x)]

Th. Fermat's theorem
f:XRNR is twice differentiable. If f admits a local extrememum at x, then f(x)=0

Convexity

Def. Convex set
A set XRN is said to be convex, if for any x,yX, the segment [x,y] lies in X, that is {αx+(1α)y:0α1}X

Th. Operations that preserve convexity

  • Ci is convex for all iI, then iICi is also convex
  • C1,C2 is convex, then C1+C2={x1+x2:x1C1,x2X2} is also convex
  • C1,C2 is convex, then C1×C2={(x1,x2):x1C1,x2X2} is also convex
  • Any projection of a convex set is also convex

Def. Convex hull
The Convex hull conv(X) of set XRN, is the minimal convex set containing X, that is

conv(X)={i=1mαixi:(x1,,xm)X,αi0,i=1mαi=1}

Def. Epigraph
The epigraph of f:XR, denoted by Epi f, is defined by {(x,y):xX,yf(x)}

Def. Convex function
Convex set X. A function f:XR is said to be convex, iff Epi f is convex, or equivalently, for all x,yX,α[0,1]

f(αx+(1α)y)αf(x)+(1α)f(y)

Moreover, f is said to be strictly convex if the inequality is strict when xy and α(0,1). f is said to be (strictly) concave if f is (strictly) convex.

Th. Convex function characterized by first-order differential
f:XRNR is differentiable. Then f is convex iff dom(f) is convex, and

x,ydom(f),f(y)f(x)f(x)(yx)

交换 x,y,得到 f(x)f(y)f(y)(xy),相加得:

f(x)f(y),xy0

其含义为 “梯度单调且内积大等于零”,这也是凸性的等价条件之一

Th. Convex function characterized by second-order differential
f:XRNR is twice differentiable. Then f is convex iff dom(f) is convex, and its Hessian is positive semidefinite (半正定)

xdom(f),2f(x)0

  • 对称阵为半正定,若其所有特征值非负;AB 等价于 AB 为半正定
  • If f is scalar (eg. xx2), then f is convex iff xdom(f),f(x)0
  • For example
    • Linear functions is both convex and concave
    • Any norm over convex set X is a convex function
      αx+(1α)yαx+(1α)yαx+(1α)y
  • Using composition rules to prove convexity

Th. Composition of convex/concave functions
Assume h:RR and g:RNR are twice differentiable. Define f(x)=h(g(x)),xRN, then

  • h is convex & non-decreasing, g is convex f is convex
  • h is convex & non-increasing, g is concave f is convex
  • h is concave & non-decreasing, g is concave f is concave
  • h is concave & non-increasing, g is convex f is concave

Proof: It holds for N=1, which suffices to prove convexity (concavity) along all lines that intersect the domain.
Example: g could be any norm

Th. Pointwise maximum of convex functions
fi is a convex function defined over convex set C for all iI, then f(x)=supiIfi(x),xC is a convex function.
Proof: Epi f=iIEpi fi is convex

  • f(x)=maxiIwix+bi over a convex set, is a convex function
  • The maximum eigenvalue λmax(M) over the set of symmetric matrices, is a convex function, since λmax(M)=supx21xMx is supremum of linear functions MxMx
    More generally, let λk(M) denote the top k eigenvalues, then Mi=1kλi(M) and Mi=nk+1nλi(M)=i=1kλi(M) are both convex function

Th. Partial infimum
Convex function f defined over convex set CX×Y, and conves set BY. Then A={xX:yB,(x,y)C} is convex set if non-empty, and g(x)=infyBf(x,y) for all xA is convex function.
For example, the distance to convex set B, d(x)=infyBxy is convex function

Th. Jensen's inequality
Let r.v. X in convex set CRN, and convex function f defined over C. Then, E[X]C,E[f(X)] is finite, and

f(E[X])E[f(X)]

Sketch of proof: extending f(αx)αf(x) and α=1 that can be interpreted as probabilities, to arbitraty contributions.

Smoothness, strong convexity

参考 [https://zhuanlan.zhihu.com/p/619288199]
考虑二阶导的 lipschitz 连续性

Def. β-smooth
称函数 fβ-smooth 的,若

x,ydom(f),f(x)f(y)βxy

等价于如下命题均成立:

  • β2x2f(x) 是凸函数
  • x,ydom(f),f(y)f(x)+f(x)(yx)+β2yx2
  • 2f(x)βI

证明/说明
证明 g(x)=β2x2f(x) 为凸,可以考虑 g(x)g(y),xy0,应用柯西不等式可证
感性理解之,f 的起伏 “拗不过” β2x2 的凸性,即 f 起伏不够大,也就是比较平滑 smooth
证明第二、三条,代入 g(y)g(x)g(x)(yx)2g(x)0 即可;它的几何含义见下

Def. α-strongly convex
称函数 fα-strongly convex 的,若

  • x,ydom(f),f(x)f(y)αxy
  • f(x)α2x2 是凸函数
  • x,ydom(f),f(y)f(x)+f(x)(yx)+α2yx2
  • 2f(x)αI

Def. γ-well-conditioned
称函数 fγ-well-conditioned 的,若其同时是 α-strongly convex 和 β-smooth 的;定义 fcondition numberγ=α/β1

Th. Linear combination of two convex functions

  • 考虑两个凸函数的加和,有:
    • fα1-strongly convex,gα2-strongly convex,则 f+g(α1+α2)-strongly convex
    • fβ1-smooth,gβ2-smooth,则 f+g(β1+β2)-smooth
  • 考虑凸函数的数乘 k>0,有:
    • fα-strongly convex,则 kf(kα)-strongly convex
    • fβ-smooth,则 kf(kβ)-smooth

证明,利用凸函数满足 f(x)f(y),xy0β2x2f(x),f(x)α2x2 的凸性即可

Projections onto convex sets

之后的算法会涉及向凸集投影的概念;定义 y 向凸集 K 的投影,为

K(y)arg min xKxy

可以证明投影总是唯一的;投影还具有一个很重要的性质:
Th. Pythagorean theorem 勾股定理
凸集 KRd,yRd,x=K(y),则任意 zKyzxz
即,对凸集内的任一点,其到投影点的距离不大于其到被投影点的距离

Constrained optimization 带约束优化

Def. Constrained optimization problem
XRN, f,gi:XR,i[m],则带约束优化问题(也称为 primal problem)的形式为

minxXf(x)subject to:gi(x)0,i[m]

infxXf(x)=p;注意到目前我们没有假设任何的 convexity;对于 g=0 的约束我们可以用 g0,g0 来刻画

Dual problem and saddle point

解决这类问题,可以先引入拉格朗日函数 Lagrange function,将约束以非正项引入;然后转化成对偶问题

Def. Lagrange function
为带约束优化问题定义拉格朗日函数,为

xX,α0,L(x,α)=f(x)+i=1mαigi(x)

其中 α=(α1,,αm) 称为对偶变量 dual variable
对于约束 g=0,其系数 α=α+α 不需要非负(但是下文给出定理时,要求 g,g 同时为凸,从而 g 得是仿射函数 affine,即形如 wx+b
注意到 p=infxsupαL(x,α),因为当 x 不满足约束时 supα 可以取到无穷大,从而刻画了约束
有趣的来了,我们能构造一个 concave function,称为对偶函数 Dual function

Def. Dual function
为带约束优化问题定义对偶函数,为

α0,F(α)=infxXL(x,α)=infxX(f(x)+i=1mαigi(x))

它是 concave 的,因为 L 是关于 α 的线性函数,且 pointwise infimum 保持了 concavity
同时注意到对任意 αF(α)infxXf(x)=p
定义对偶问题

Def. Dual problem
为带约束优化问题定义对偶问题,为

maxαF(α)subject to:α0

对偶问题是凸优化问题,即求 concave 函数的最大值,记其为 d;由上文可知 dp,也就是:

d=supαinfxL(x,α)infxsupαL(x,α)=p

称为弱对偶 weak duality,取等情况称为强对偶 strong duality
接下来会给出:

凸优化问题满足约束规范性条件 constraint qualification (Slater's contidion) 时(此为充分条件),有 d=p,且该解的充要条件为拉格朗日函数的鞍点 saddle point

Def. Constraint qualification (Slater's condition)
假设集合 X 的内点非空 int(X)

  • 定义 strong constraint qualification (Slater's condition)

x¯int(X):i[m],gi(x¯)<0

  • 定义 weak constraint qualification (weak Slater's condition)

x¯int(X):i[m],(gi(x¯)<0)(gi(x¯)=0gi affine)

(这个条件是在说明解存在吗?)
基于 Slater's condition,叙述拉格朗日函数的鞍点 saddle point 是带约束优化问题的解的充要条件

Th. Saddle point - sufficient condition
带约束优化问题,如果其拉格朗日函数存在鞍点 saddle point (x,α),即

xX,α0,L(x,α)L(x,α)L(x,α)

x 是该问题的解,f(x)=inff(x)

Th. Saddle point - necessary condition
假设 f,gi,i[m]convex function

  • 若满足 Slater's condition,则带约束优化问题的解 x 满足存在 α0 使得 (x,α) 是拉格朗日函数的鞍点
  • 若满足 weak Slater's conditionf,gi 可导,则带约束优化问题的解 x 满足存在 α0 使得 (x,α) 是拉格朗日函数的鞍点

由于书本上没提供必要性的证明,且充分性证明不难但是不够漂亮,所以就不抄了,只给出我自己的思路(虽然可能有缺陷,下图也只是示意):

回到最初的不等式:

d=supαinfxL(x,α)infxsupαL(x,α)=p

定义 x 取到 p=supαL(x,α)supαL(x,α), x(可能有多个)
定义 α 取到 d=infxL(x,α)infxL(x,α), α(可能有多个)
(鞍点)唯一性和函数凸性有关(虽然感觉不严格凸的话可以是一片“平”的区域),留给之后再说吧
试证明:

L 存在鞍点、存在一组 (x,α) 是鞍点、p=dp=d=L(x,α) 四者等价

分别记为命题 A,B,C,D,显然有 BA,DC

证明 AB,BCD
若存在鞍点(思路见图右上),记鞍点为 (x,α),则 L(x,α)L(x,α)=infxL(x,α)infxL(x,α),观察不等式两头则取等;进而有 infxL(x,α)L(x,α)=infxL(x,α),根据 α 定义该不等式取等,于是可以令 αα;接下来对 x,x 同理
此时 (x,α) 是鞍点,即 x,α, L(x,α)L(x,α)L(x,α),则有 p=supαL(x,α)=L(x,α)=infxL(x,α)=d

证明 CBD
p=d,即 supαL(x,α)=infxL(x,α),且 supαL(x,α)L(x,α)infxL(x,α);故三者取等,故 x,α, L(x,α)supαL(x,α)=L(x,α)=infxL(x,α)L(x,α),从而 (x,α) 是鞍点

综上得证。爽!

KKT conditions

Lagrangian version

若带约束优化问题满足 convexity,我们就可以用一个定理解决:KKT

Th. Karush-Kuhn-Tucker's theorem
假设 f,gi:XR,i[m],为 convex and differentiable,且满足 Slater's condition;则带约束优化问题

minxX,g(x)0f(x)

其拉格朗日函数为 L(x,α)=f(x)+αg(x),α0
x¯ 是该问题的解,当且仅当存在 α¯0,满足:

xL(x¯,α¯)=xf(x¯)+α¯xg(x¯)=0αL(x¯,α¯)=g(x¯)0α¯g(x¯)=0;KKT conditions

其中后两条称为互补条件 complementarity conditions,即对任意 i[m],α¯i0,gi(x¯)0,且满足 α¯igi(x¯)=0

充要性证明:

必要性,x¯ 为解,则存在 α¯ 使得 (x¯,α¯) 为鞍点,从而得到 KKT 条件:第一条即鞍点定义,第二、三条:

α,L(x¯,α)L(x¯,α¯)αg(x¯)α¯g(x¯)α+g(x¯)0α0α¯g(x¯)=0

充分性,满足 KKT 条件,则对于满足 g(x)0x

f(x)f(x¯)xf(x¯)(xx¯);convexity of f=α¯xg(x¯)(xx¯);first condα¯(g(x)g(x¯));convexity of g=α¯g(x)0;third cond

Gradient descent version

我们还能用另一种思路阐述 KKT 定理,并且也是另一种求解带约束优化问题的方法:梯度下降
由于 g 均为凸函数,显然 K={x:g(x)0} 为凸集;因此约束其实就是把取值限定在凸集 K 上,于是有:

Th. Karush-Kuhn-Tucker's theorem, gradient descent version
假设 fconvex and differentiableK 为凸集;则带约束优化问题

minxKf(x)

x 是该问题的解,当且仅当

yK,f(x)(yx)0

其思想为,负梯度方向为 f 函数值下降的方向,若其与 yx 有相同方向的分量(内积大于零),则可以沿该方向移动,则 x 不会是最优点
这个定理为我们的梯度下降算法提供了基础

Gradient descent

Unconstrained case

无约束凸优化问题的梯度下降 GD 算法
t=f(xt),ht=f(xt)f(x),dt=xtx

Algorithm. Gradient descent_Input T,x0,{ηt}for t=0,,T1 doxt+1=xtηttend forreturn x¯=argminxt{f(xt)}

其中合理地选择 ηt,决定了算法的效率;取 Polyak stepsizeηt=htt2,有:

Th. Bound for GD with Polyak stepsize
假设 tG,则

f(x¯)f(x)=min0tT{ht}min{Gd0T,2βd02T,3G2αT,βd02(1γ4)T}

该定理的证明建立在 dt+12dt2ht2/t2 上,这就感觉很扯淡了,因为用绝对值不等式放缩出来总是反的...
证明算法的效率时,可以用一些 “potential 势差” 函数来刻画,比如刻画势能 ht=f(xt)f(x) 的降低,考虑势能差 ht+1ht、梯度的范数 t;比如到最优点的距离 dt=xx 的降低,考虑 dt+1dt
先给出引理,代入 smooth 和 strong convexity 可以证明;这些式子方便我们后续进行放缩

α2dt2htβ2dt212βt2ht12αt2

对于更新式 xt+1=xtηtt,如果考虑 dt+1dt,那就两边同减 x 并取范数,用三角不等式放缩,但是符号和上文的假设是反的,到此就证不下去了;如果假定 dt+12dt2ht2/t2,那就能证明上面的 bound(注意 f(x¯)f(x)1Ttht 可以不等式放缩)
不过,我们可以从 ht+1ht 出发,有:

ht+1ht=f(xt+1)f(xt)f(xt)(xt+1xt)+β2xt+1xt2=ηtt2+β2ηt2t2=12βt2;令 ηt=1βαβht=γht

于是 hT(1γ)hT1(1γ)Th0eγTh0,这倒是个很不错的收敛保证,因此有

Th. Bound for GD, unconstrained case
假设 fγ-well-conditioned,令 ηt=1β,则

hTeγTh0

Constrained case

带约束优化的梯度下降,只需要每次移动后,投影到凸集

Algorithm. Basic gradient descent_Input T,x0K,{ηt}for t=0,,T1 doxt+1=ΠK(xtηtt)end forreturn x¯=argminxt{f(xt)}

它有个类似的上限:
Th. Bound for GD, constrained case
假设 fγ-well-conditioned,令 ηt=1β,则

hTeγT/4h0

证明
首先是投影的定义

xt+1=ΠK(xtηtt)=argminxxxt+ηtt=argminx(t(xxt)+12ηtxxt2)

根据如下式子,可以令 ηt=1β;从而有:

ht+1ht=f(xt+1)f(xt)t(xt+1xt)+β2xt+1xt2=minx(t(xxt)+β2xxt2)

为了摘掉 min,我们可以代入某个 x,代入哪个呢?可以考虑两点的连线,即 (1μ)xt+μx

ht+1htminx[xt,x](t(xxt)+β2xxt2)μt(xxt)+μ2β2xxt2μht+μ2βα2xxt2;α-strong convexμht+μ2βααht;Lemma

μ[0,1] 取极值,得到

ht+1ht(1α4(βα))ht(1γ4)hteγ/4

从而得证

GD: Reductions to non-smooth and non-strongly convex functions

现在来考虑梯度下降对不一定 smooth、或不一定 strong convex 的凸函数时该怎么分析;下文提到的 reduction 方法可以导出近似最优的收敛速度,而且很简单、很普适

Case 1. reduction to smooth, non-strongly convex functions

考虑仅有 β-smooth 情况;不过实际上,凸函数都是 0-strongly convex
做法为,加一个适当的 strongly-convex 函数,将原函数扳成更加 strongly-convex 的

Algorithm. Gradient descent, reduction to β-smooth functions_Input f,T,x0K, parameter α~Let g(x)=f(x)+α~2xx02Apply GD on g,T,{ηt=1β},x0

α~=βlogTD2T 时,这个算法的效率为 hT=O(βlogTT);对 GD 多加处理可以做到 O(β/T)

由于 fβ-smooth 和 0-strongly convex,加上一个 α~-smooth、α~-strongly convex 的 xx02,由上述提到的凸函数求和的定理,g(β+α~)-smooth 和 α~-strongly convex
因此,对于 f,有:

ht=f(xt)f(x)=g(xt)g(x)+α~2(xx02xtx02)h0gexpα~t/4(α~+β)+α~D2;D is diameter of bounded K=O(βlogtt);choosing α~=βlogtD2t, ignore some constants

Case 2. reduction to strongly convex, non-smooth functions

考虑仅有 α-strongly convex 的情况,考虑将其改造得更平缓的方法——平滑操作
最简单的平滑操作就是邻域取平均,记 f 平滑后的函数为 f^δ:RdR,记 B={v:v1},取半径为 δ 的球域做平均,用期望表示为:

f^δ(x)=EvU(B)[f(x+δv)]

这种平滑方法具有如下性质,假设 fG-Lipschitz 连续:

  • fα-strongly convex,则 f~δ 也是 α-strongly convex
  • f^δ(dG/δ)-smooth
  • 任意 xK|f^δ(x)f(x)|δG

证明:
第一条利用凸函数的线性组合,对于 f^δ(x)=vPr[v]f(x+δv)dv,由于任意 v,函数 f(x+δv) 都是 α-strongly convex 的,因此考察强凸性时,直接提出得到 αvPr[v]dv=α;同时可见即使不是均匀分布,依然是这个结论
第二条利用斯托克斯公式,由于是均匀分布,可以将其转化为球面 S={v:v=1} 上的积分

EvS[f(x+δv)v]=δdf^δ(x)

再利用 f(x)f(y)xy 可证明 smoothness,其步骤和第三条的证明类似
第三条的证明为:

|f^δ(x)f(x)|=|EvU(B)[f(x+δv)]f(x)|EvU(B)[|f(x+δv)f(x)|];JensenEvU(B)[Gδv];LipschitzGδ

从而算法为:

Algorithm. Gradient descent, reduction to non-smooth functions_Input f,T,x0K, parameter δLet f^δ(x)=EvU(B)[f(x+δv)]Apply GD on g,T,{ηt=δ},x0

δ=dGαlogtt 时,该算法的效率为 hT=O(G2dlogtαt)

暂不考虑如何计算 f^δ 的梯度,后面会给出估计方法
首先 f^δαδdG-well-conditioned

ht=f(xt)f(x)f^δ(xt)f^δ(x)+2δG;for |f^δ(x)f(x)|δGh0eαδt4dG+2δG=O(dG2logtαt);δ=dGαlogtt

另外,如果在原函数 f 直接做 GD 的话,依然有收敛保证,但是我们需要取序列的加权和:
ηt=2α(t+1),得到迭代序列 x1,,xt,则

f(1tk=1t2kt+1xk)f(x)2G2α(t+1)

证明略

Case 3. reduction to general convex functions (non-smooth, non-strongly convex)

如果同时使用上述两个方法,会得到一个 O~(d/t) 的方法,不过它得依赖 d
在 OCO 问题中会给出一个 O(1/t) 更一般算法

Fenchel duality

凸优化问题,对 f 不可导或者有无穷大的值的情况进行分析

posted @   zrkc  阅读(393)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)
历史上的今天:
2021-08-19 【笔记】Splay Tree
点击右上角即可分享
微信分享提示