支持向量机
Support Vector Machine
Two Margins
Functional Margin \(\gamma\) and Geometrical Margin \(\hat{\gamma}\)
In fact, \(\gamma\) is not a good measure of confidence for its one property. \(w\) and \(b\) can be multiplied by any factor \(k\) and the \(\gamma\) can be arbitrarily large.
Basic Knowledge for SVM
Goal is maximize the geometrical margin \(\hat{\gamma}\):
and this problem can be transformed into a nicer one:
This optimization problem can be solved using quadratic programming method.
Lagrange duality
Primal optimization problem
we introduce \(\theta_{\mathcal{P}}(w)=\max_{\alpha,\beta:\alpha_i\ge 0}\mathcal {L}(w,\alpha,\beta)\)
then we get
it has the same solution as our original primal problem.
dual problem is
and \(d^* \le p^*\) (Keep in mind: max min <= min max)
the optimization problem for SVM is (is a convex optimization problem)
first, we fix \(\alpha_i\) to minimize the formula above with respect to \(w\) and \(b\). we use derivatives and finally get:
obtain
then we max \(\mathcal{W}(\alpha)\) with \(\alpha_i \ge 0\) and \(\sum_{i=1}^m \alpha_i y^{(i)} = 0\) as constraints. and \({(x^{(i)})}^Tx^{(j)}\) can be seen as inner product.
when there's a new x as input for predicting, \(w^Tx+b=\sum_{i=1}^m\alpha_iy^{(i)}\text{InnerProduct}(x^{(i)},x)+b\)
only \(\alpha_i\) for support vectors are not zero, so when we use SVM to predict, we just to calculate the inner product of the new x and support vectors.
Kernel
For example, Gaussian kernel is:
Gaussian kernel can map the vector to infinite dimensions. (0,1]
A valid kernel function \(\iff\) there exists a feature mapping \(\phi\) that the kernel can be represented as \(K(x,z)=\phi(x)^T\phi(z)\)
Kernel Matrix K
for \(\forall m\) sampling points, its \((i,j)\) element \(K(i,j)=K(x^{(i)},x^{(j)})\). this \(K_{ij}\) is Kernel Matrix for Kernel \(K\)
A valid kernel function \(\iff\) its Kernel matrix is positive semi-definite (\(K \ge 0\))
Support vector Expansion
this expansion expression is only related to the support vectors.
Select Kernel
Linear kernel, Gaussian kernel, Polynomial kernel
L1 regularization soft margin SVM
get rid of over-fitting
formally,
after derivation, we get:
SMO Algorithm
SMO is Sequential Minimal Optimization
Coordinate ascent
loop until convergence {
for i = 1 to m {
alpha <- argmax_{alpha_i} W(alpha_1,alpha_2,...,alpha_m)
}
}
choose one \(\alpha_i\) as the variable to \(\arg\max\) until it reaches convergence.
In SMO, we choose a pair of \(\alpha_i\) and \(\alpha_j\) every step.