Notes : <Hands-on ML with Sklearn & TF> Chapter 5
- capable of performing linear and nonlinear classification, regression even outlier detection
- particularly well suit for classification of complex but small or medium size datasets
Linear SVM Classification¶
- 不仅仅在匹配一条可以准确分割两类的线了,而是要fit出一条尽可能宽的街道。
- 在两个边界之外的实例与decision boundary(决定边界)没有关系,影响边界的只有街道里面的实例,被称为support vector(支持向量)
- sensitive to the feature scale ; need feature scaling like StandardScaler
Soft Margin(边缘) Classification
- all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
- in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]
y = (iris['target']==2).astype(np.float64)
svm_clf = Pipeline((
('scaler', StandardScaler()),
('linear_svc', LinearSVC(C=1, loss='hinge')),
))
svm_clf.fit(X, y)
svm_clf.predict([[5.5, 1.7]]) #don't output probabilities
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
svc_clf = Pipeline((
('scaler', StandardScaler()),
('svc', SVC(kernel='linear', C=1)),
))
m = len(X)
C =1
sgd_clf = Pipeline((
('scaler', StandardScaler()),
('sgd', SGDClassifier(loss='hinge', alpha=1/(m*C)))
))
svc_clf.fit(X, y)
sgd_clf.fit(X, y)
print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))
- sgd慢,但可以处理巨大数据集和online classification task,svc更慢
- LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
- loss = hinge
- for better preformance set dual=False, unless more features than training instances
Nonlinear SVM Classification¶
- add more features like Polynomial to use linear Classification
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
('poly_features', PolynomialFeatures(degree=3)),
('scaler', StandardScaler()),
('svm_clf', LinearSVC(C=10, loss='hinge'))
))
polynomial_svm_clf.fit(X, y)
import matplotlib.pyplot as plt
def plot_predictions(clf, axes):
x0s = np.linspace(axes[0], axes[1], 100)
x1s = np.linspace(axes[2], axes[3], 100)
x0, x1 = np.meshgrid(x0s, x1s)
X = np.c_[x0.ravel(), x1.ravel()]
y_pred = clf.predict(X).reshape(x0.shape)
y_decision = clf.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)
def plot_datasets(X, y, axes):
plt.plot(X[:, 0][y==0], X[:, 1][y==0], 'bs')
plt.plot(X[:, 0][y==1], X[:, 1][y==1], 'g^')
plt.axis(axes)
plt.grid(True, which='both')
plt.xlabel(r'$x_1$', fontsize=10)
plt.ylabel(r'$x_2$', fontsize=10, rotation=0)
plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])
plt.show()
- 可以使用grid search需要最优参数。常常首先粗略的搜索,在细致的搜索
Adding Similarity Feature
- add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
- $$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距离,由此可计算出新的Features,并抛弃之前的$$
- how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset
Gauss RBF Kernel
- 数据集很大的时候计算非常费时
- SVM的魔力就在于不必真实的添加feature就可以获得与添加之后相似的结果
rbf_kernel_svm_clf = Pipeline((
('scaler', StandardScaler()),
('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X,y)
%matplotlib inline
from sklearn.svm import SVC
gamma1, gamma2 = 0.1, 5
C1, C2 = 0.001, 1000
hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)
svm_clfs = []
for gamma, C in hyperparams:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))
))
rbf_kernel_svm_clf.fit(X, y)
svm_clfs.append(rbf_kernel_svm_clf)
plt.figure(figsize=(14, 9))
for i, svm_clf in enumerate(svm_clfs):
plt.subplot(221 + i)
plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])
plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])
gamma, C = hyperparams[i]
plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)
plt.show()
- $\gamma$增大,instance的影响范围就变小,bell-shape curve变瘦
- $\gamma$增小,instance的影响范围就变大,bell-shape curve变胖
- $C$和$\gamma$类似
- 其他的kernel用的少或者应用面窄。
- String kernels常用在文本分类和DNA序列如:使用string subsequence kernel或者其他基于Levenshtein distance
- how to choose:
- linear first, LinearSVC is fast
- then SVC(kernel='linear')
- not to large, try the rbf kernel
- try others kernel using cross-validation and grid search
Computational Complexity
Class | Time Complexity | Out-of-core Support | Scaling required | Kernel Trick |
---|---|---|---|---|
LinearSVC | $O(m\times n)$ | No | Yes | No |
SGDClassifier | $O(m\times n)$ | Yes | Yes | No |
svc | $O(m^2\times n)\\ to\\ O(m^3\times n)$ | No | Yes | Yes |
SVM Regression¶
- 与分类器的区别在于要使support vector在margin被限制的条件下尽可能多
from sklearn.svm import LinearSVR
import numpy.random as rnd
rnd.seed(42)
m = 50
X = 2 * rnd.rand(m, 1)
y = (4 + 3 * X + rnd.randn(m, 1)).ravel()
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X,y)
svm_reg1 = LinearSVR(epsilon=1.5)
svm_reg2 = LinearSVR(epsilon=0.5)
svm_reg1.fit(X, y)
svm_reg2.fit(X, y)
def find_support_vectors(svm_reg, X, y):
y_pred = svm_reg.predict(X)
off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)
return np.argwhere(off_margin) #nonzero(满足条件)的位置
svm_reg1.support_ = find_support_vectors(svm_reg1, X, y) #实际上是margin之外的
svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)
eps_x1 = 1
eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilon
def plot_svm_regression(svm_reg, X, y, axes):
x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)
y_pred = svm_reg.predict(x1s)
plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")
plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")
plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")
plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors='#FFAAAA')
plt.plot(X, y, "bo")
plt.xlabel(r"$x_1$", fontsize=18)
plt.legend(loc="upper left", fontsize=18)
plt.axis(axes)
plt.figure(figsize=(9, 4))
plt.subplot(121)
plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)
plt.ylabel(r"$y$", fontsize=18, rotation=0)
#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)
plt.annotate(
'', xy=(eps_x1, eps_y_pred), xycoords='data',
xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),
textcoords='data', arrowprops={'arrowstyle': '<->', 'linewidth': 1.5}
)
plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)
plt.subplot(122)
plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])
plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)
plt.show()
Under the Hood¶
Decision Function and Predictions
- $$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
- Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.
iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]
y = (iris['target']==2).astype(np.float64)
from mpl_toolkits.mplot3d import Axes3D
def plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):
x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])
X_crop = X[x1_in_bounds]
y_crop = y[x1_in_bounds]
x1s = np.linspace(x1_lim[0], x1_lim[1], 20)
x2s = np.linspace(x2_lim[0], x2_lim[1], 20)
x1, x2 = np.meshgrid(x1s, x2s)
xs = np.c_[x1.ravel(), x2.ravel()]
df = (xs.dot(w) + b).reshape(x1.shape)
m = 1 / np.linalg.norm(w)
boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]
margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]
margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]
ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100)
ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")
ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")
ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)
ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")
ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")
ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")
ax.axis(x1_lim + x2_lim)
ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)
ax.set_xlabel(r"Petal length", fontsize=15)
ax.set_ylabel(r"Petal width", fontsize=15)
ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18)
ax.legend(loc="upper left", fontsize=16)
svm_clf2 = LinearSVC(C=10, loss='hinge')
svm_clf2.fit(X, y)
fig = plt.figure(figsize=(11, 6))
ax1 = fig.add_subplot(111, projection='3d')
plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])
plt.show()
Training Objective
- 要求所有的都分类正确(hard margin)$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
- 于是就转化为了约束优化问题$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之所以不用\left \| w \right \|是因为其不可微 $$
- 为了得到soft margin,一如松弛变量$:\ \varsigma^{(i)} \geq 0$,用来测量违反margin的程度
- $w$表示斜率,越小margin越宽;$\varsigma$越小表示违反程度越低,但margin也会越小
- $C$用来平衡$w$和$\varsigma$
- $$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$
Quadratic Programming
- $$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
- use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM
The Dual Problem
- a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
- for SVM, the primal and dual problem has the same solution
- Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
- from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$
Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$, without having to compute(or enev to know about) the transformaton $\phi$
$$
\begin{align*}
Linear &:\ \ \ K(a,b)=a^T \cdot b \\
Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \\
Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \\
Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\
\end{align*}
$$
- 只要K满足Mercer's Condition(连续,可交换)就知道对于:$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$一定存在。尽管可能不知道是啥,但这样就可以做kernel
- 计算$\widehat w$时会有一个$\phi (x)$,可能是很难计算的。将$\widehat w$带入cost function中: $$ \begin{align*} h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \\ &= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \\ &= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \\ &= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\ \widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \\ &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\\ &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)})) \end{align*} $$
Online SVMs
- means learning icrementrally
- for Linear SVM Classifier: $$ J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b)) $$
- $hinge\ loss\ function = max(0,\ 1-4)$