支持向量机

Support Vector Machine

Two Margins

Functional Margin \(\gamma\) and Geometrical Margin \(\hat{\gamma}\)

\[\hat{\gamma^{(i)}}=\frac{\gamma^{(i)}}{||w||} \]

In fact, \(\gamma\) is not a good measure of confidence for its one property. \(w\) and \(b\) can be multiplied by any factor \(k\) and the \(\gamma\) can be arbitrarily large.

Basic Knowledge for SVM

Goal is maximize the geometrical margin \(\hat{\gamma}\):

\[\max_{\hat{\gamma},w,b} \hat{\gamma} \\ s.t.\ \frac{y^{(i)}(w^Tx^{(i)}+b)}{||w||} \ge \hat{\gamma}\ \ \ i=1,...,m \]

and this problem can be transformed into a nicer one:

\[\min_{w,b} \frac{1}{2}{||w||}^{2} \\ s.t.\, y^{(i)}(w^Tx^{(i)}+b) \ge 1,\ \ \ i=1,...,m \]

This optimization problem can be solved using quadratic programming method.

Lagrange duality

Primal optimization problem

\[\min_{w} f(w) \\ s.t.\, g_{i}(w) \le 0,\, i=1,...,k \\ h_{i}(w)=0,\, i=1,...,l \]

we introduce \(\theta_{\mathcal{P}}(w)=\max_{\alpha,\beta:\alpha_i\ge 0}\mathcal {L}(w,\alpha,\beta)\)

then we get

\[\min_w \theta_{\mathcal{P}}(w)=\min_w \max_{\alpha,\beta:\alpha_i\ge 0} \mathcal {L}(w,\alpha,\beta)=p^* \]

it has the same solution as our original primal problem.

dual problem is

\[\max_{\alpha,\beta:\alpha_i\ge 0} \theta_{\mathcal{D}}(\alpha,\beta)=\max_{\alpha,\beta:\alpha_i\ge 0} \min_w \mathcal {L}(w,\alpha,\beta)=d^* \]

and \(d^* \le p^*\) (Keep in mind: max min <= min max)

the optimization problem for SVM is (is a convex optimization problem)

\[\min_{w,b,\alpha} \mathcal{L}(w,b,\alpha)=\frac{1}{2}{||w||}^2-\sum_{i=1}^{m}\alpha_i[y^{(i)}(w^Tx^{(i)}+b)-1] \]

first, we fix \(\alpha_i\) to minimize the formula above with respect to \(w\) and \(b\). we use derivatives and finally get:

\[w(\alpha)=\sum_{i=1}^{m}\alpha_i y^{(i)}x^{(i)} \\ \sum_{i=1}^m \alpha_i y^{(i)} = 0 \]

obtain

\[\mathcal{W}(\alpha)=\mathcal{L}(w,b,\alpha)=\sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i,j=1}^m y^{(i)}y^{(j)}\alpha_i \alpha_j {(x^{(i)})}^Tx^{(j)} \]

then we max \(\mathcal{W}(\alpha)\) with \(\alpha_i \ge 0\) and \(\sum_{i=1}^m \alpha_i y^{(i)} = 0\) as constraints. and \({(x^{(i)})}^Tx^{(j)}\) can be seen as inner product.

when there's a new x as input for predicting, \(w^Tx+b=\sum_{i=1}^m\alpha_iy^{(i)}\text{InnerProduct}(x^{(i)},x)+b\)

only \(\alpha_i\) for support vectors are not zero, so when we use SVM to predict, we just to calculate the inner product of the new x and support vectors.

Kernel

For example, Gaussian kernel is:

\[K(x,z)=\exp({-\frac{||x-z||^2}{2\sigma^2}}) \]

Gaussian kernel can map the vector to infinite dimensions. (0,1]

A valid kernel function \(\iff\) there exists a feature mapping \(\phi\) that the kernel can be represented as \(K(x,z)=\phi(x)^T\phi(z)\)

Kernel Matrix K

for \(\forall m\) sampling points, its \((i,j)\) element \(K(i,j)=K(x^{(i)},x^{(j)})\). this \(K_{ij}\) is Kernel Matrix for Kernel \(K\)

A valid kernel function \(\iff\) its Kernel matrix is positive semi-definite (\(K \ge 0\))

Support vector Expansion

\[w^Tx+b=\sum_{i=1}^{m}\alpha_iy^{(i)}K(x^{(i)},x)+b \]

this expansion expression is only related to the support vectors.

Select Kernel

Linear kernel, Gaussian kernel, Polynomial kernel

L1 regularization soft margin SVM

get rid of over-fitting

formally,

\[\min \frac{1}{2}{||w||}^2+C\sum_{i=1}^{m}\xi_{i} \\ s.t.\, y^{(i)}(w^Tx^{(i)}+b) \ge 1-\xi_{i}\ \ \ i=1,...,m \\ \xi_i \ge 0 \]

after derivation, we get:

\[\max_{\alpha} W(\alpha) = \sum_{i=1}^{m}\alpha_i-\frac{1}{2}\sum_{i,j=1}^{m}y^{(i)}y^{(j)}\alpha_i\alpha_j \text{InnerProduct}(x^{(i)},x^{(j)}) \\ s.t.\,0\le\alpha_i\le C,\ \ \ i=1,...,m \\ \sum_{i=1}^{m}\alpha_iy^{(i)}=0 \]

SMO Algorithm

SMO is Sequential Minimal Optimization

Coordinate ascent

loop until convergence {
	for i = 1 to m {
		alpha <- argmax_{alpha_i} W(alpha_1,alpha_2,...,alpha_m)
	}
}

choose one \(\alpha_i\) as the variable to \(\arg\max\) until it reaches convergence.

In SMO, we choose a pair of \(\alpha_i\) and \(\alpha_j\) every step.

posted @ 2022-08-13 10:56  19376273  阅读(33)  评论(0编辑  收藏  举报