牛顿法(Newton's method):

Lipschitz连续(Lipschitz continuous):


\[\forall \bm x,\forall\bm y, |f(\bm x)-f(\bm y)|\leqslant\mathcal L\|\bm x-\bm y\|_2 \]

cosine similarity 余弦相似度、余弦距离:

\[\cos(\bm a,\bm b)=\frac{\bm a\cdot\bm b}{|\bm a||\bm b|} \in [0,1] \]

Hamming distance 汉明距离:

\[ham\_dist(\bm a,\bm b)=\sum_iI(a_i=b_i) \in \mathbb N^+ \]

\(I(x=y)\) 表示x=y时取值1,否则0。

Levenshtein distance, edit distance 编辑距离:

Hamming distance与Levenshtain distance差别的一个简单例子是'flaw'和'lawn',汉明距离为4,编辑距离为2。

Jaro-Winkler distance:

Euclidean distance 欧氏距离:

\[d(\bm a,\bm b)=\|\bm a-\bm b\| \]

切比雪夫逼近问题 Chebyshev approximation problem:

\(x_1,x_2\in\mathbb R^n, \theta\in\mathbb R\) ,则 \(y=\theta x_1+(1-\theta)x_2=x_2+\theta(x_1-x_2)\) 是由点 \(x_1,x_2\) 构成的“直线”(line)。(没有限定 \(\theta\in[0,1]\)

line segment:
\(\theta x_1+(1-\theta)x_2, \theta\in[0,1]\) (i.e. \(x_2+\theta(x_1-x_2)\) ),点 \(x_1, x_2\in\mathbb R^n\) 之间的线段(line segment)。

仿射集(affine set):
A set \(C\in\mathbb R^n\) is affine if for any \(x_1, x_2\in C\) , and any \(\theta\in\mathbb R\) , we have \(x_1+\theta(x_2-x_1)\in C\) .(凸集要求 \(\theta\in[0,1]\)


仿射组合(affine combination):

\[\sum_i^k\theta_i x_i, \text { with } \sum_i^k\theta_i=1 \]

仿射包(affine hull):
集合 \(C\in\mathbb R^n\) 内所有元素的仿射组合生成的仿射集,记为 \(\mathrm{aff} C\) 。(仿射包是仿射集)

\[\mathrm{aff} C= \left\{\sum_i^n\theta_i x_i|x_i\in C,\sum_i\theta_i=1\right\} \]


凸组合(convex combination):

\[\sum_i^k\theta_i x_i, \text { with } \theta_i\ge0, \sum_i^k\theta_i=1 \]

凸集convex set):
A set \(C\) is convex if \(\forall x_1,x_2\in C, \theta\in [0,1]\) , we have \(\theta x_1+(1-\theta)x_2\in C\) .(仿射集定义中的 \(\theta\)\(\mathbb R\) 上取值)

凸包(convex hull):
集合C的凸包是集合内任意元素的凸组合生成的最小凸集,记为 \(\mathrm{conv} C\) (凸包是凸集)。即

\[\mathrm{conv} C=\left\{\sum_i^k\theta_i x_i|x_i\in C, \theta_i\in[0,1],\sum_i^k\theta_i=1\right\} \]



cone,非负同性 nonnegative homogeneous):
对于集合C,若任意 \(x\in C\) 以及任意 \(\theta\ge0\) ,有 \(\theta x\in C\) ,则称集合C是一个锥。
A set C is called a cone, if for every \(x\in C\) and \(\theta\ge0\) , we have \(\theta x\in C\) .


凸锥convex cone):
对任意 \(x_1,x_2\in C, \theta_1,\theta_2\ge 0\) ,有 \(\theta_1x_1+\theta_2x_2\in C\)
A set \(C\) is a convex cone if \(\forall x_1,x_2\in C, \theta_1,\theta_2\in\mathbb R\) , we have \(\theta_1x_1+\theta_2x_2\in C\) .

\[ \theta_1 x_1+\theta_2 x_2\in C \impliedby \forall \theta_1,\theta_2\ge 0, \forall x_1,x_2\in C \]

\(\theta_1x_1+\theta_2x_2, \theta_1,\theta_2\ge 0\) 的几何含义是,从原点0分别到点 \(x_1,x_2\) 的“射线”围成的区域。

proper cone:
A cone \(K\subseteq \R^n\) is called a proper cone if it satisfies the following:

  • convex.
  • closed.
  • solid (has nonempty interior).
  • pointed (contains no line, or equivalently, \(x\in C\wedge -x\in C\implies x=0\) or equivalently \(K\cap -K = \{0\}\) ).

pointed 尖的:
Intuitive representation: 在集合中没有“直线”只有“射线”。如下2个图都是锥,上图则不是尖的,因该形状中存在的都是直线而不是具有顶点的射线,此外可以看出 \(K\cap -K=K=-K\) ;下图的锥则是尖的,

(proper cone is variously defined depending on the context and author.)

generalized inequality: (**partial ordering on \(\R^n\) **)
A proper cone \(K\subseteq \R^n\) can be used to define a generalized ienquality, which is a partial ordering on \(\R^n\) . We associate with the proper cone \(K\) the partial ordering on \(\R^n\) defined by

\[x\preceq_K y \iff y-x\in K \]

, and denote \(y\succeq x\) for \(x\preceq y\) .

strict partial ordering:

\[x\prec_K y \iff y-x\in K^\circ \]

where \(K\) is a proper cone, and \(K^\circ\) denotes the interior of \(K\) .

When \(K=\R^+=[0, +\infty)\) , the partial ordering \(\preceq_K\) is the usual ordering \(\le\) on \(\R\) , and the strict parital ordering \(\prec_K\) is the usual strict ordering \(<\) on \(\R\) .

properties of generalized inequality:

  • preserved under addition.

\[x\preceq_K y \wedge u\preceq_K v \implies x+u \preceq_K y+v \]

  • transitive.

\[x\preceq_K y \wedge y\preceq_K z \implies x\preceq_K z \]

  • preserved under nonnegative scaling.

\[x\preceq_K y \wedge \alpha\ge 0 \implies \alpha x \preceq_K \alpha y \]

  • reflexive. \(x \preceq_K x\) .
  • antisymmetric.

\[x \preceq_K y \wedge y \preceq_K x \implies x=y \]

  • preserved under limits.

\[(\forall i\in\N^+: x_i \preceq_K y_i)\wedge \lim_{i\to \infty}x_i = x \wedge \lim_{i\to\infty}y_i = y \implies x \preceq_K y \]

锥组合(conic combinamtion, 非负线性组合 nonnegative linear combination):

\[\sum_i\theta_ix_i, \theta_i\ge0 \]

锥包(conic hull):

\[\left\{\sum_i\theta_ix_i|x_i\in C,\theta_i\ge0\right\} \]



  • The empty set ∅, any single point (i.e., singleton) \(\{x_0\}\) , and the whole space \(\mathbb R^n\) are affine (hence, convex) subsets of \(\mathbb R^n\) .
  • Any line is affine. If it passes through zero, it is a subspace, hence also a convex cone.
  • A line segment is convex, but not affine (unless it reduces to a point).
  • A ray, which has the form \(\{x_0+\theta v|\theta\ge0\}\) , where \(v\ne0\) , is convex, but not affine. It is a convex cone if its base x0 is 0.
  • Any subspace is affine, and a convex cone (hence convex).

线性函数linear function) satisfies the equalities:

\[f(\alpha x)=\alpha f(x), \forall\alpha\in K\subseteq\mathbb R f(x+y)=f(x)+f(y) \]

a linear funtion \(f:\mathbb R^n\mapsto \mathbb R^m\) has the form \(\bm f(\bm x)=\bm A\bm x\) .

仿射函数(affine function): A function \(f: \mathbb R^n\mapsto\mathbb R^m\) is affine if it is a sum of a linear function and a constant, i.e. it has the form \(\bm f(\bm x)=\bm A\bm x+\bm b\) . Or equivalently, a function \(f(x)\) is affine if and only if \(g(x)=f(x)-f(0)\) is linear.

凸函数 Convex Function

定义域是凸集,域内任意x,y满足不等式 \(f[\theta x+(1-\theta)y]\le \theta f(x)+(1-\theta)f(y), \forall\theta\in[0,1]\) 的函数 \(f\)

A function \(f:\mathbb R^n\mapsto\mathbb R\) is convex if domain of \(f\) , i.e. \(\mathrm{dom} f\) , is convex, and \(\forall x,y\in \bold{dom} f\) and \(\forall \theta\in[0,1]\) , we have

\[f[\theta x+(1-\theta)y]\le \theta f(x)+(1-\theta)f(y). \]

If a function \(f:\R^n\mapsto \R\) is differentiable( \(\nabla f\) exists), then \(f\) is convex if and only if \(\bold{dom} f\) is convex and

\[f(\bm y)\ge f(\bm x)+\nabla f(\bm x)^T (\bm y-\bm x), \forall \bm x, \bm y\in\bold{dom}f \]


If a function \(f: \R^n \mapsto \R\) is twice differentiable( \(\nabla^2 f\) exists), then \(f\) is convex if and only if \(\bold{dom} f\) is convex and the Hessian matrix is positive semidefinite), i.e.

\[\nabla^2 f(\bm x)\succeq 0, \forall \bm x \in \bold{dom} f \]


strictly convex:
A function \(f: \R^n\mapsto \R\) is strictly convex if and only if \(\mathrm{dom} f\) (domain of \(f\) ) is convex, and

\[\forall x,y\in \mathrm{dom} f, \neq y, \forall \theta\in (0,1): f(\theta x+(1-\theta)y)\lt \theta f(x)+(1-\theta) f(y) \]


几何含义:凸函数曲线上任意两点 $$ 和之间的弦(两点间的线段)处于函数曲线的上方(f(x), f(y)两点间的函数曲线是 \(f(\theta x+(1-\theta)y), \theta\in[0,1]\) ,而f(x),f(y)间的线段是 \(\theta f(x)+(1-\theta)f(y), \theta\in[0,1]\) )。事实上凸函数曲线看起来是凹型的,因此凸也会被叫作下凸

concave function:

\[f(\alpha x+\beta y)\ge \alpha f(x)+\beta f(y), \forall \alpha,\beta\in\mathbb R \text{ with } \alpha+\beta=1,\alpha\ge0,\beta\ge0 \\ \Big\Updownarrow \\ f[\theta x+(1-\theta)y]\ge\theta f(x)+(1-\theta)f(y), \forall \theta\in[0,1] \]

Or, \(f\) is concave if \(-f\) is convex.


Affine functions are the only functions which are both convex and concave.

Well, functions which are both convex and concave are affine, instead that they must be linear. In constrast, functions which are both quasiconvex and quasiconcave are called quasilinear, and there is no terminology of quasiaffine function for now.

凸集的仿射是凸集,逆仿射也是凸集。 if \(S\subseteq\mathbb R^n\) is convex and \(f:\mathbb R^n\mapsto\mathbb R^m\) is an affine function. Then the image of \(S\) under \(f\) , \(f(S)=\{f(x)|x\in S\}\) , is convex.

Operations that preserve convexity 保凸变换:

  1. Non-negative weighted sum.
  2. Composition with an affine mapping.
  3. Pointwise maximum and supremum.
  4. Composition with functions satisfying some properties.

\[f = h\circ \bm g \]

\(f: \R^n\mapsto \R; h: \R^p\mapsto \R; \bm g: \R^n\mapsto\R^p, \bm g(\bm x)=[g_1(\bm x), g_2(\bm x),...,g_p(\bm x)]^T\)
\(f\) is convex if \(h\) is ...(increasing, or non-decreasing in each argument) and \(g_i\) is ...(convex).
5. Perspective of a function.
Let \(f: \R^n\mapsto \R\) , then the perspective of \(f\) is the function \(g: \R^{n+1}\mapsto\R\) given by

\[g(\bm x, t)= t\cdot f\left(\frac{\bm x}{t}\right) \]

with domain

\[\operatorname{\bold{dom}} g = \{(\bm x, t)| \frac{\bm x}{t}\in \operatorname{\bold{dom}} f, t\gt 0\} \]

The perspective operation preserve convexity.

  • 凸函数的非负线性组合是凸函数。
  • 凸函数的仿射复合是凸函数。

Exmaples of convex functions:

Componentwise-max of a vector: \(f(\bm x) = \max_i x_i\) is convex


\[\forall \theta\in[0,1]: \\ \begin{aligned} f(\theta\bm x+(1-\theta)\bm y) & =\max_i (\theta x_i+(1-\theta) y_i) \\ & \le \max_i \theta x_i + \max_i (1-\theta)y_i \\ & = \theta \max_i x_i + (1-\theta) \max_i y_i \\ & = \theta f(\bm x) + (1-\theta) f(\bm y) \end{aligned} \]

Quasiconvex function


[Def] Let \(f:\R^n\mapsto \R\) , domain of \(f\) , i.e. \(\bold{dom} f\) , is a convex set, to qualify as quasiconvex function, \(f\) must satisfy:

\[f(\theta x+(1-\theta)y)\le \max\{f(x), f(y)\}, \forall x,y\in\bold{dom} f\subseteq\R^n, \forall \theta\in[0,1] \]


Equivalent definition:
[Def] A function \(f: \R^n \mapsto \R\) is called a quasiconvex function if its domain and all its sublevel sets are convex, i.e. \(\bold{dom} f\) is convex and \(S_\alpha=\{x\in \bold{dom} f: f(x)\le \alpha\}\) for \(\forall \alpha\in \R\) are convex.

该定义如此理解:函数视为曲线,x轴为函数自变量(该维可以是多维的),y轴是因变量,是实数值的,对于任意 \(y=\alpha\) 直线(显然其与x轴平行),函数曲线在其下方(即y轴负向)区域所对应的x取值集合必须是凸集(若x是单维度的则相当于取值集合是连续的区间)。


A convex function is always a quasiconvex function, but a quasiconvex function may be or may be not convex.

An example of a quasiconvex function which is not convex.

An exmaple of a non-quasiconvex function.

strictly quasiconvex function:

\[f(\theta x+(1-\theta)y)< \max\{f(x), f(y)\}, \forall x,y\in\bold{dom}f, x\ne y, \forall \theta\in(0,1) \]



[Def] quasiconcave: The negative of a quasiconvex function.

\[f(\theta x+(1-\theta)y)\ge \min\{f(x), f(y)\}, \forall x,y\in\bold{dom} f, \forall\theta\in[0,1] \]

strctly quasiconcave:

\[f(\theta x+(1-\theta)y) > \min\{f(x), f(y)\},\forall x,y\in\bold{dom}f\subseteq\R^n, x\ne y, \forall \theta\in(0,1) \]

the (joint) density function of the normal distribution is quasiconcave but not concave.

rank of positive semidefinite matrix is quasiconcave:

\[\mathrm{rank}(X+Y)\ge \min\{\mathrm{rank} X, \mathrm{rank} Y\}; X,Y\in \R^{n\times n}_+ \]


[Def] quasilinear: To be both quasiconvex and quasiconcave.

An example of quasilinear functions.

another quasilinear functions:

  • logarithm. \(\log x\)
  • ceiling function \(\mathrm{ceil}(x)=\inf\{n\in \Z: n\le x\}\)

Quasiconvex functions can be concave or discontinuous.

Differentiable quasiconvex functions

Let \(f: \R^n\mapsto \R\) to be a differentiable function. \(f\) is quasiconvex if and only if

\[f(x)\ge f(y) \implies \left[\nabla f(x)\right]^T(y-x)\le 0 \forall x,y\in \bold{dom} f \]


更啰嗦地,任意两个点,记其中函数值较大者为 \(<x, f(x)>\) ,较小者为 \(<y, f(y)>\) ,当x≤y时则曲线从较大者f(x)到较小者f(y)的变化趋势由 \(\nabla f(x)\) 反映,要求变化趋势是“下坡的”即 \(\nabla f(x)≤0\) ,而x≥y时则曲线从较大者f(x)到较小者f(y)的变化趋势由 \(-\nabla f(x)\) 反映,要求变化趋势是“下坡的”即 \(-\nabla f(x)\le 0\) ,综合情况即可得上述一阶导数条件公式。


Let \(f: \R^n\mapsto \R\) to be a twice differentiable function. If \(f\) is quasiconvex, then

\[y^T f(x)=0 \implies y^T \nabla^2 f(x) y \ge 0, \forall x\in \bold{dom} f, \forall y\in \R^n \]


对于标量域上(如 \(\R\) ,非多维 \(\R^n\) )的 凸函数与拟凸函数 在 一阶和二阶导数条件对比:

  • |凸|拟凸


  • 对于凸函数: f'(x)=0蕴含f(x)是极小值。
  • 对于拟凸函数:……

Convex Conjugate:
(Legendre-Fenchel transform)
The convex conjugate of a function \(f: \R^d \mapsto \R\) is a function \(f^*\) defined by

\[f^*(\bm s) = \sup_{\bm x\in \R^d}(\langle \bm s, \bm x \rangle - f(\bm x)) \]

where \(\langle \cdot, \cdot \rangle\) is the general inner product.
The convex conjugate definition does not require the function \(f\) to be convex nor differentiable .

The conjugate of a conjugate is the original function.

Log-convex, Log-convave

[Def] A function f is called logarithmically convex, log-convex, if \(\log f(x)\) is convex. And log-concave in a similar way.

\[f \text{ is log-convex} \iff \frac1f \text{ is log-concave} \]

Jensen's inequality

Finite form:

Let \(f: \R^d\mapsto \R\) be a convex function, \(\bm x_i\in \mathrm{\bold{ dom}} f \subseteq \R^d\) , then

\[\forall n\in\N^+, \forall \theta_i\ge 0 \in \R: \\ f\left(\frac{\sum_{i=1}^n \theta_i \bm x_i}{\sum_{i=1}^n \theta_i}\right) \le \frac{\sum_{i=1}^n \theta_i f(\bm x_i)}{\sum_{i=1}^n \theta_i} \]

and the equality holds if and only if \(\bm x_i = \bm x_2= ... =\bm x_n\) .

And the inequality is reversed if \(f\) is concave, which is

\[f\left(\frac{\sum_{i=1}^n \theta_i \bm x_i}{\sum_{i=1}^n \theta_i}\right) \ge \frac{\sum_{i=1}^n \theta_i f(\bm x_i)}{\sum_{i=1}^n \theta_i} \]

As a special case, with \(\sum_{i=1}^n \theta_i = 1\) , the inequality is

\[f\left(\sum_{i=1}^n \theta_i \bm x_i\right) \le \sum_{i=1}^n \theta_i f(\bm x_i) \]

Probabilistic form:

Let \(f: \R^d\mapsto \R\) be a convex function, \(X\) a random variable, then

\[f(\mathbb E[X]) \le \mathbb E[f(X)] \]

where \(\mathbb E[\cdot]\) denotes the expectation of a random variable.

Optimization Problems

General optimization problems

can be expressed in the form:

\[\min f(x) \\ \text{s.t. } g_i(x)\le 0, i=0,..,m \\ h_i(x)=0, i=1,...,p \]

\(f:\R^n\mapsto\R\) is known as the objective function, \(x\in\R^n\) the optimization variable(s), \(g_i(x)\le 0\) the inequality constraints, \(g:\R^n\mapsto\R\) the inequality constraint function, \(h_i(x)=0\) the equality constraints, \(h:\R^n\mapsto\R\) the equality constraint function.

The domain of the optimization problem:

\[\mathcal D = \bold{dom} f \cap \bigcap_i \bold{dom} g_i \cap \bigcap_i \bold{dom} h_i \]

A point \(x\in \mathcal D\) is said feasible if it satisfies the inequality constraints and the equality constraints. The set of all feasible points is called the feasible set, feasible region, solution space, or search space.

optimal variable points are equivalent in problems by scaling:

\[\begin{aligned} \text{min } & \theta f(x) \\ \text{s.t. } & \alpha_i g_i(x)\le 0, i=0,1,...,m \\ & \beta_i h_i(x) =0, i=0,1,...,p \end{aligned} \\ \theta>0, \alpha_i>0, \beta_i\ne 0 \]

necessary and sufficient condition for a differentiable-objective-function and unconstrained optimization(max/min) problem is that let the gradient with respect to optimal variable be zero:

\[\nabla_x f(x)=0 \]

substitution of variables 换元法

Let \(\phi: \R^n\mapsto\R^n, z=\phi(x)\) be a bijective function, and \(\tilde f(z)=\tilde f(\phi(x))=f(x), \tilde g_i(z)=g_i(x), \tilde h_i(z)=h_i(x)\) , then a standard optimization problem on variable \(x\) can be transformed to

\[\begin{aligned} \min &&& \tilde f(z) \\ \text{s.t. } &&& \tilde g_i(z)\le 0 \\ &&& \tilde h_i(z)=0 \end{aligned} \]

Slack variables 松弛变量

By introducing slack variables, we can transform inequality constraints on \(g_i(x)\) to equality constraints:

\[\begin{aligned} \min &&& f(x) \\ \text{s.t. } &&& s_i \ge 0\\ &&& g_i(x)+s_i = 0 \\ &&& h_i(x)=0 \end{aligned} \]

Optimizing over some variables 逐分量优化

\[\inf_{x,y} f(x,y)\equiv\inf_x \inf_y f(x,y) = \inf_y \inf_x f(x,y), x\in\R^n, y\in\R^m \]

In other words, we can always minimize a function by first minimizing over some of the variables, then minimizing over the remaining ones.

Convex Optimization Problems

additional 3 requirements of convex optimization problems comparing with general optimization problems:

  • The objective function is convex
  • The inequality functions are convex
  • The equality functions are affine.

A convex problem is one of the form:

\[\begin{aligned} \text{min } & f(x) \\ \text{s.t. } & g_i(x)\le 0, i=1,...,m \\ & h_i(x)=a_i^Tx+b_i=0 , i=1,...,p \end{aligned} \]

where \(f, g_i\) are convex.

Abstract Convex Optimization Problems

Minimizing a convex function over a convex set is sometimes called the abstract convex optimization problem. (which does not require the inequality constraint functions are convex nor the equality constraint functions are affine.)

An optimality criterion for a differentiable objective function

If the objective function \(f\) in a convex optimization problem is differentiable, and let \(S\) be the feasible set, i.e. \(S=\{x|g_i(x)\le 0 \wedge h_j(x)=0\}\) , then \(x^*\) is optimal if and only if

\[\nabla f(x^*)^T(y-x^*)\ge 0, \forall y\in S \]

Geometical interpretation: ...

Linear optimization problems (Linear Programming)

objective function: affine
constraint functions: affine

A linear optimization problem, also called linear programming, is a convex optimization problem whose objective and constraint(inequality and equality) functions are all affine.

\[\begin{aligned} \text{min } & c^T x+d \\ \text{s.t. } & Gx\le h \\ & Px=q \end{aligned} \]

Quadratic optimization problems (Quadratic programming)

objective function: quadratic
constraint functions: affine

A convex optimization problem is called a quadratic program if the objective function is quadratic (convex), and the constraint functions are affine.

\[\begin{aligned} \text{min } & x^T Ax+c^T x+d \\ \text{s.t. } & G^T x\le h \\ & Px=q \end{aligned} \]

quadratically constrainted quadratic program:
objective function: quadratic
inequality constraint function: quadratic
equality constraint function: affine

Quasiconvex optimization problems

additional requirements for quasiconvex optimization problems comparing with general optimization problems:

  • The objective function is quasiconvex.
  • The inequality functions are convex. (quasiconvex constraint functions can be replaced with convex constraint functions)
  • The equality functions are affine.

Solving a quasiconvex optimization problem can be reduced to solving a sequence of convex optimization problems.

Duality 对偶

Lagrangian Multipliers

An general optimization probelm:

\[\begin{aligned} \min &&& f(x) \\ \text{s.t. } &&& g_i(x)\le 0, i=1,...,m \\ &&& h_i(x)=0, i=1,...,p \end{aligned} \]

where \(x\in\R^n, f:\R^n\mapsto\R, g_i: \R^n\mapsto\R, h_i: \R^n\mapsto\R\) . (no convex assumption)

define Langrangian \(\mathcal L: \R^n\times\R^m\times\R^p\mapsto\R\) .

with Inequality Constraints

Lagrange multipliers: The idea of Lagrange multipliers is to replace the contraints with a linear function to convert the constrained problem (the primal problem) to a unconstrained one (the dual problem).

the primal problem:

\[\begin{aligned} \min_\bm x &&& f(\bm x) \\ \text{ s.t. } &&& g_i(\bm x)\leqslant 0, \forall i = 1,2,...,m. \end{aligned} \]


applying Lagrange multipliers:

\[\begin{aligned} \mathfrak L(\bm x, \bm \lambda) &= f(\bm x) + \sum_i^m \lambda_i g_i(\bm x) \\ &= f(\bm x) + \bm \lambda^T\bm g(\bm x) \end{aligned} \]

the associated Lagrangian dual problem:

\[\max_{\bm\lambda} \mathfrak D(\bm \lambda) \\ \text{ s.t. } \bm \lambda \geqslant \bm 0 \]

where \(\mathfrak D(\bm\lambda)= \min_{\bm x} \mathfrak L(\bm x, \bm \lambda)\) .

In contrast to the original optimization problem, which has constraints, \(\min_{\bm x}f(\bm x, \bm \lambda)\) is an unconstrained optimization problem for a given value of $\bm \lambda $ . If solving \(\min_{\bm x}f(\bm x,\bm \lambda)\) is easy, then the overall problem is easy to solve. The reason is that the outer problem (maximization over λ) is a maximum over a set of affine functions, and hence is a concave function, even though f(·) and gi(·) may be nonconvex. The maximum of a concave function can be efficiently computed.

with Equality Constraints

the primal problem:

\[\min_{\bm x} f(\bm x) \\ \text{ s.t. } h_i(\bm x)=0, i=1,2,...,m. \]


the Lagrangian

\[\mathfrak L(\bm x, \bm \lambda)=f(\bm x)+\bm\lambda^T \bm h(\bm x) \]

Taking the partial derivatives of \(\mathfrak L(\bm x,\bm \lambda)\) with respect to \(\bm x, \bm \lambda\) and setting them to zero

\[\nabla_{\bm x} \mathfrak L(\bm x, \bm\lambda)= 0 \\ \nabla_{\bm \lambda}\mathfrak L(\bm x, \bm\lambda)=0 \]

the resolved \(\bm x, \bm \lambda\) are the solution of the primal problem.

记号 上确界(supremum), 下确界(infimum):

\[\sup_{x\in X} f(\cdot), \inf_{x\in X} f(\cdot) \]

Max-Min inequality

For any function with two arguments, $f: X \times Y \mapsto \R, (\bm x, \bm y) \mapsto f(\bm x, \bm y) \in \R $ , the max-min inequality:

\[\max_{\bm x \in X} \min_{\bm y \in Y} f(\bm x, \bm y) \leqslant \min_{\bm y\in Y} \max_{\bm x\in X} f(\bm x, \bm y) \\ \Big\Updownarrow \\ \sup_{\bm x \in X} \inf_{\bm y \in Y} f(\bm x, \bm y) \leqslant \inf_{\bm y\in Y} \sup_{\bm x\in X} f(\bm x, \bm y) \]

MiniMax theorem

The equality holds if \(f, X, Y\) satisfies a strong max-min property( or a saddle-point property). Formally:
Let \(X\subset \R^n, Y\subset \R^m\) be compact convex sets. If \(f: X\times Y \mapsto \R\) is a continous function that is concave-convex, i.e. (1) \(f(\cdot, \bm y): X \mapsto \R\) is concave for fixed \(\bm y\) , and (2) \(f(\bm x, \cdot): Y \mapsto \R\) is convex for fixed \(\bm x\) . Then the equality in max-min inequality holds, i.e.

\[\max_{\bm x \in X} \min_{\bm y \in Y} f(\bm x, \bm y) = \min_{\bm y\in Y} \max_{\bm x\in X} f(\bm x, \bm y) \]

Linear Programming

As a special case of convex optimization:

\[\min_{\bm x} \bm c^T \bm x \\ \text{ s.t. } \bm A \bm x \leqslant \bm b \]

The Lagrangian is given by

\[\mathfrak L(\bm x, \bm \lambda)=\bm c^T \bm x+\bm \lambda^T(\bm A \bm x-\bm b) \]

rewriting to

\[\mathfrak L(\bm x, \bm \lambda)=(\bm c^T+\bm \lambda^T \bm A)\bm x - \bm \lambda ^T \bm b \]

Taking the derivative with respect to x

\[\begin{aligned} \frac{\mathrm d}{\mathrm d \bm x} \mathfrak L(\bm x,\bm \lambda) &= \frac{\mathrm d}{\mathrm d \bm x} \left[ \bm c^T \bm x+\bm \lambda^T(\bm A \bm x-\bm b) \right] \\ &= \frac{\mathrm d}{\mathrm d \bm x}\left[ (\bm c^T+\bm \lambda^T \bm A)\bm x - \bm \lambda ^T \bm b \right] \\ &= \bm c+ \bm A^T\bm\lambda \end{aligned} \]

and setting it to 0

\[\bm c+\bm A^T \bm\lambda=\bm 0 \]

With the above constraint, we can resolve the \(\min_{\bm x} \mathfrak L(\bm x,\bm \lambda)\) , or the Lagrangian dual problem

\[\begin{aligned} \mathfrak D(\bm\lambda) &= \min_{\bm x} \mathfrak L(\bm x, \bm \lambda) \\ &= \min_{\bm x} \left[ (\bm c^T+\bm \lambda^T \bm A)\bm x - \bm \lambda ^T \bm b \right] \\ &= \bm 0^T\bm x - \bm \lambda ^T \bm b \text{ 因为导数为0时取得极小 } \\ &= -\bm \lambda^T \bm b \\ &= -\bm b^T \bm\lambda \end{aligned} \]

the dual optimization problem:

\[\max_{\bm \lambda} \mathfrak D(\bm \lambda) = \max_{\bm \lambda} -\bm b^T\bm\lambda \\ \text{ s.t. } \\ \begin{aligned} \bm c+\bm A^T\bm \lambda=\bm 0 & \text{ 求解}\min_{\bm x}\mathfrak L \text{时引入的约束} \\ \bm \lambda \geqslant \bm 0 & \text{ 拉格朗日乘子自带约束} \end{aligned} \]

Quadratic Programming

objective function: convex quadratic
constraints: affine

The primal problem:

\[\min_{\bm x} \frac12 \bm x^T \bm Q \bm x + \bm c^T \bm x \\ \text{ s.t. } \\ \bm A \bm x \leqslant \bm b \]

where \(Q\) is a square symmetric positive semidefinite matrix.

The Lagrangian is given by

\[\mathfrak L (\bm x, \bm \lambda)=\frac12 \bm x^T\bm Q\bm x+\bm c^T\bm x +\bm \lambda^T (\bm A\bm x-\bm b) \]

Taking the derivative with respect to x and setting it to 0

\[\frac{\mathrm d}{\mathrm d \bm x} \mathfrak L(\bm x, \bm \lambda)=\bm Q\bm x+\bm c+ \bm A^T \bm \lambda = \bm 0 \\ \implies \\ \bm x= -\bm Q^{-1}(\bm A^T\bm \lambda+\bm c) \]

Substituing into the primal Lagrangian, we get the dual Lagrangian

\[\mathfrak D(\bm \lambda)= \frac12 (\bm A^T\bm \lambda+\bm c)^T\bm Q^{-T}\bm Q\bm Q^{-1}(\bm A^T\bm \lambda+\bm c)+(\bm A^T\bm \lambda+\bm c)^T \bm Q^{-1} (\bm A^T\bm \lambda+\bm c) - \bm \lambda^T \bm b \\ = -\frac12 (\bm A^T\bm \lambda+\bm c)^T \bm Q^{-1} (\bm A^T\bm \lambda+\bm c) - \bm \lambda^T \bm b \]

The dual Lagrangian optimization problem

\[\max_{\bm\lambda} \mathfrak D(\bm \lambda) \iff \\ \max_{\bm\lambda} -\frac12 (\bm A^T\bm \lambda+\bm c)^T \bm Q^{-1} (\bm A^T\bm \lambda+\bm c) - \bm \lambda^T \bm b \\ \text{ s.t. } \bm\lambda \geqslant \bm 0 \]





\[\mathbb S=\{ \bm x|\forall i,g_{(i)}(\bm x)=0 \wedge \forall j,h^{(j)}(\bm x)\leqslant0 \} L(\bm x, \bm\lambda,\bm\alpha)=f(\bm x)+\sigma_i\lambda_ig^{(i)}(\bm x)+\sigma_j\alpha_j h^{(j)}(\bm x) \]


\[\min_{\bm x}\max_{\bm\lambda}\max_{\bm\alpha,\bm\alpha\geq0} \; L(\bm x,\bm\lambda,\bm\alpha) \sim \min_{\bm x\in\mathbb S}f(\bm x) \]


KKT条件(确定最优点的必要条件,不一定充分): TODO

把最小化或最大化的函数称作目标函数(objective function),而最小化问题时,目标函数也称代价函数(cost function)、损失函数(loss function)或误差函数(error function)。

Least Square 最小二乘

Partial Least Square, PLS


  1. 可适用于样本数比特征数少的情况(数据矩阵的秩小于特征数)
  2. 可适用于特征之间严重线性相关的情况
  3. ……


