Python机器学习——线性模型
http://www.dataguru.cn/portal.php?mod=view&aid=3514
摘要 : 最近断断续续地在接触一些python的东西。按照我的习惯,首先从应用层面搞起,尽快入门,后续再细化一 些技术细节。找了一些资料,基本语法和数据结构搞定之后,目光便转到了scikit-learn这个包。
最近断断续续地在接触一些python的东西。按照我的习惯,首先从应用层面搞起,尽快入门,后续再细化一 些技术细节。找了一些资料,基本语法和数据结构搞定之后,目光便转到了scikit-learn这个包。这个包是基于scipy的统计学习包。里面所涵盖 的算法接口非常全面。更令人振奋的是,其用户手册写得非常好。然而,其被墙了(或者没有,或者有时被墙有时又好了)。笔者不会FQ(请嘲笑吧),笔者只有 找代理,笔者要忍受各种弹窗。因此笔者决定,做一个记录,把这用户手册的内容多多少少搬一点到我的博客中来。以备查询方便。因此,笔者就动手了。 声明:如何安装python及其IDE,相关模块等等不在本系列博客的范围。本博客仅试图记录可能会用到的代码实例。 1.广义线性模型这里的“广义线性模型”,是指线性模型及其简单的推广,包括岭回归,lasso,LAR,logistic回归,感知器等等。下面将介绍这些模型的基本想法,以及如何用python实现。 1.1.普通的最小二乘由 from sklearn import linear_model clf = linear_model.LinearRegression() clf.fit ([[0,0],[1,1],[2,2]],[0,1,2]) #拟合 clf.coef_ #获取拟合参数 上面代码给出了实现线性回归拟合以及获得拟合参数的两个主要函数。下面的脚本则给出了一个较为完整的实例。 脚本: print __doc__ import pylab as pl import numpy as np from sklearn import datasets, linear_model diabetes = datasets.load_diabetes() #载入数据 diabetes_x = diabetes.data[:, np.newaxis] diabetes_x_temp = diabetes_x[:, :, 2] diabetes_x_train = diabetes_x_temp[:-20] #训练样本 diabetes_x_test = diabetes_x_temp[-20:] #检测样本 diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] regr = linear_model.LinearRegression() regr.fit(diabetes_x_train, diabetes_y_train) print 'Coefficients :\n', regr.coef_ print ("Residual sum of square: %.2f" %np.mean((regr.predict(diabetes_x_test) - diabetes_y_test) ** 2)) print ("variance score: %.2f" % regr.score(diabetes_x_test, diabetes_y_test)) pl.scatter(diabetes_x_test,diabetes_y_test, color = 'black') pl.plot(diabetes_x_test, regr.predict(diabetes_x_test),color='blue',linewidth = 3) pl.xticks(()) pl.yticks(()) pl.show() 1.2.岭回归岭回归是一种正则化方法,通过在损失函数中加入L2范数惩罚项,来控制线性模型的复杂程度,从而使得模型更稳健。岭回归的函数如下: from sklearn import linear_model clf = linear_model.Ridge (alpha = .5) clf.fit([[0,0],[0,0],[1,1]],[0,.1,1]) clf.coef_ 下面的脚本提供了绘制岭估计系数与惩罚系数之间关系的功能 print __doc__ import numpy as np import pylab as pl from sklearn import linear_model # X is the 10x10 Hilbert matrix X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis]) y = np.ones(10) # Compute paths n_alphas = 200 alphas = np.logspace(-10, -2, n_alphas) clf = linear_model.Ridge(fit_intercept=False) #创建一个岭回归对象 coefs = []#循环,对每一个alpha,做一次拟合 for a in alphas: clf.set_params(alpha=a) clf.fit(X, y) coefs.append(clf.coef_)#系数保存在coefs中,append # Display results ax = pl.gca() ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm']) ax.plot(alphas, coefs) ax.set_xscale('log') #注意这一步,alpha是对数化了的 ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis pl.xlabel('alpha') pl.ylabel('weights') pl.title('Ridge coefficients as a function of the regularization') pl.axis('tight') pl.show() 使用GCV来设定正则化系数的代码如下: clf = linear_model.RidgeCV(alpha = [0.1, 1.0, 10.0]) clf.fit([[0,0],[0,0],[1,1]],[0,.1,1]) clf.alpha_ 1.3. Lassolasso和岭估计的区别在于它的惩罚项是基于L1范数的。因此,它可以将系数控制收缩到0,从而达到变量选择的效果。它是一种非常流行的变量选择 方法。Lasso估计的算法主要有两种,其一是用于以下介绍的函数Lasso的coordinate descent。另外一种则是下面会介绍到的最小角回归(笔者学生阶段读过的最令人佩服的文章之一便是Efron的这篇LARS,完全醍醐灌顶,建议所有 人都去读一读)。 clf = linear_model.Lasso(alpha = 0.1) clf.fit([[0,0],[1,1]],[0,1]) clf.predict([[1,1]]) 下面给出一个脚本,比较了Lasso和Elastic Net在处理稀疏信号中的应用 print __doc__ import numpy as np import pylab as pl from sklearn.metrics import r2_score # generate some sparse data to play with np.random.seed(42) n_samples, n_features = 50, 200 X = np.random.randn(n_samples, n_features) coef = 3 * np.random.randn(n_features) inds = np.arange(n_features) np.random.shuffle(inds)#打乱观测顺序 coef[inds[10:]] = 0 # sparsify coef y = np.dot(X, coef) # add noise y += 0.01 * np.random.normal((n_samples,)) # Split data in train set and test set n_samples = X.shape[0] X_train, y_train = X[:n_samples / 2], y[:n_samples / 2] X_test, y_test = X[n_samples / 2:], y[n_samples / 2:] # Lasso from sklearn.linear_model import Lasso alpha = 0.1 lasso = Lasso(alpha=alpha)#Lasso对象 y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)#拟合并预测 r2_score_lasso = r2_score(y_test, y_pred_lasso) print lasso print "r^2 on test data : %f" % r2_score_lasso # ElasticNet from sklearn.linear_model import ElasticNet enet = ElasticNet(alpha=alpha, l1_ratio=0.7) y_pred_enet = enet.fit(X_train, y_train).predict(X_test) r2_score_enet = r2_score(y_test, y_pred_enet) print enet print "r^2 on test data : %f" % r2_score_enet pl.plot(enet.coef_, label='Elastic net coefficients') pl.plot(lasso.coef_, label='Lasso coefficients') pl.plot(coef, '--', label='original coefficients') pl.legend(loc='best') pl.title("Lasso R^2: %f, Elastic Net R^2: %f" % (r2_score_lasso, r2_score_enet)) pl.show() 1.3.1.如何设置正则化系数1.3.1.1. 使用交叉验证有两个函数实现了交叉验证, 1.3.1.2. 使用信息准则AIC,BIC。这些准则计算起来比cross validation方法消耗低。然而使用这些准则的前提是我们对模型的自由度有一个恰当的估计,并且假设我们的概率模型是正确的。事实上我们也经常遇到 这种问题,我们还是更希望能直接从数据中算出些什么,而不是首先建立概率模型的假设。 下面的这个脚本比较了几种设定正则化参数的方法:AIC,BIC以及cross-validation import time import numpy as np import pylab as pl from sklearn.linear_model import LassoCV, LassoLarsCV, LassoLarsIC from sklearn import datasets diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target rng = np.random.RandomState(42) X = np.c_[X, rng.randn(X.shape[0], 14)] # add some bad features # normalize data as done by Lars to allow for comparison X /= np.sqrt(np.sum(X ** 2, axis=0)) # LassoLarsIC: least angle regression with BIC/AIC criterion model_bic = LassoLarsIC(criterion='bic')#BIC准则 t1 = time.time() model_bic.fit(X, y) t_bic = time.time() - t1 alpha_bic_ = model_bic.alpha_ model_aic = LassoLarsIC(criterion='aic')#AIC准则 model_aic.fit(X, y) alpha_aic_ = model_aic.alpha_ def plot_ic_criterion(model, name, color): alpha_ = model.alpha_ alphas_ = model.alphas_ criterion_ = model.criterion_ pl.plot(-np.log10(alphas_), criterion_, '--', color=color, linewidth=3, label='%s criterion' % name) pl.axvline(-np.log10(alpha_), color=color, linewidth=3, label='alpha: %s estimate' % name) pl.xlabel('-log(alpha)') pl.ylabel('criterion') pl.figure() plot_ic_criterion(model_aic, 'AIC', 'b') plot_ic_criterion(model_bic, 'BIC', 'r') pl.legend() pl.title('Information-criterion for model selection (training time %.3fs)' % t_bic) # LassoCV: coordinate descent # Compute paths print "Computing regularization path using the coordinate descent lasso..." t1 = time.time() model = LassoCV(cv=20).fit(X, y)#创建对像,并拟合 t_lasso_cv = time.time() - t1 # Display results m_log_alphas = -np.log10(model.alphas_) pl.figure() ymin, ymax = 2300, 3800 pl.plot(m_log_alphas, model.mse_path_, ':') pl.plot(m_log_alphas, model.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) pl.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha: CV estimate') pl.legend() pl.xlabel('-log(alpha)') pl.ylabel('Mean square error') pl.title('Mean square error on each fold: coordinate descent ' '(train time: %.2fs)' % t_lasso_cv) pl.axis('tight') pl.ylim(ymin, ymax) # LassoLarsCV: least angle regression # Compute paths print "Computing regularization path using the Lars lasso..." t1 = time.time() model = LassoLarsCV(cv=20).fit(X, y) t_lasso_lars_cv = time.time() - t1 # Display results m_log_alphas = -np.log10(model.cv_alphas_) pl.figure() pl.plot(m_log_alphas, model.cv_mse_path_, ':') pl.plot(m_log_alphas, model.cv_mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2) pl.axvline(-np.log10(model.alpha_), linestyle='--', color='k', label='alpha CV') pl.legend() pl.xlabel('-log(alpha)') pl.ylabel('Mean square error') pl.title('Mean square error on each fold: Lars (train time: %.2fs)' % t_lasso_lars_cv) pl.axis('tight') pl.ylim(ymin, ymax) pl.show() 1.4. Elastic NetElasticNet是对Lasso和岭回归的融合,其惩罚项是L1范数和L2范数的一个权衡。下面的脚本比较了Lasso和Elastic Net的回归路径,并做出了其图形。 print __doc__ # Author: Alexandre Gramfort # License: BSD Style. import numpy as np import pylab as pl from sklearn.linear_model import lasso_path, enet_path from sklearn import datasets diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target X /= X.std(0) # Standardize data (easier to set the l1_ratio parameter) # Compute paths eps = 5e-3 # the smaller it is the longer is the path print "Computing regularization path using the lasso..." models = lasso_path(X, y, eps=eps) alphas_lasso = np.array([model.alpha for model in models]) coefs_lasso = np.array([model.coef_ for model in models]) print "Computing regularization path using the positive lasso..." models = lasso_path(X, y, eps=eps, positive=True)#lasso path alphas_positive_lasso = np.array([model.alpha for model in models]) coefs_positive_lasso = np.array([model.coef_ for model in models]) print "Computing regularization path using the elastic net..." models = enet_path(X, y, eps=eps, l1_ratio=0.8) alphas_enet = np.array([model.alpha for model in models]) coefs_enet = np.array([model.coef_ for model in models]) print "Computing regularization path using the positve elastic net..." models = enet_path(X, y, eps=eps, l1_ratio=0.8, positive=True) alphas_positive_enet = np.array([model.alpha for model in models]) coefs_positive_enet = np.array([model.coef_ for model in models]) # Display results pl.figure(1) ax = pl.gca() ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k']) l1 = pl.plot(coefs_lasso) l2 = pl.plot(coefs_enet, linestyle='--') pl.xlabel('-Log(lambda)') pl.ylabel('weights') pl.title('Lasso and Elastic-Net Paths') pl.legend((l1[-1], l2[-1]), ('Lasso', 'Elastic-Net'), loc='lower left') pl.axis('tight') pl.figure(2) ax = pl.gca() ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k']) l1 = pl.plot(coefs_lasso) l2 = pl.plot(coefs_positive_lasso, linestyle='--') pl.xlabel('-Log(lambda)') pl.ylabel('weights') pl.title('Lasso and positive Lasso') pl.legend((l1[-1], l2[-1]), ('Lasso', 'positive Lasso'), loc='lower left') pl.axis('tight') pl.figure(3) ax = pl.gca() ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k']) l1 = pl.plot(coefs_enet) l2 = pl.plot(coefs_positive_enet, linestyle='--') pl.xlabel('-Log(lambda)') pl.ylabel('weights') pl.title('Elastic-Net and positive Elastic-Net') pl.legend((l1[-1], l2[-1]), ('Elastic-Net', 'positive Elastic-Net'), loc='lower left') pl.axis('tight') pl.show() 1.5. 多任务Lasso多任务Lasso其实就是多元Lasso。Lasso在多元回归中的推广的tricky在于如何设置惩罚项。这里略。 1.6. 最小角回归
print __doc__ # Author: Fabian Pedregosa # Alexandre Gramfort # License: BSD Style. import numpy as np import pylab as pl from sklearn import linear_model from sklearn import datasets diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target print "Computing regularization path using the LARS ..." alphas, _, coefs = linear_model.lars_path(X, y, method='lasso', verbose=True)#lars算法的求解路径 xx = np.sum(np.abs(coefs.T), axis=1) xx /= xx[-1] pl.plot(xx, coefs.T) ymin, ymax = pl.ylim() pl.vlines(xx, ymin, ymax, linestyle='dashed') pl.xlabel('|coef| / max|coef|') pl.ylabel('Coefficients') pl.title('LASSO Path') pl.axis('tight') pl.show() logistic 回归Logistic回归是一个线性分类器。类 print __doc__ # Authors: Alexandre Gramfort # Mathieu Blondel # Andreas Mueller # License: BSD Style. import numpy as np import pylab as pl from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.preprocessing import StandardScaler digits = datasets.load_digits() X, y = digits.data, digits.target X = StandardScaler().fit_transform(X) # classify small against large digits y = (y > 4).astype(np.int) # Set regularization parameter for i, C in enumerate(10. ** np.arange(1, 4)): # turn down tolerance for short training time clf_l1_LR = LogisticRegression(C=C, penalty='l1', tol=0.01) clf_l2_LR = LogisticRegression(C=C, penalty='l2', tol=0.01) clf_l1_LR.fit(X, y) clf_l2_LR.fit(X, y) coef_l1_LR = clf_l1_LR.coef_.ravel() coef_l2_LR = clf_l2_LR.coef_.ravel() # coef_l1_LR contains zeros due to the # L1 sparsity inducing norm sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100 sparsity_l2_LR = np.mean(coef_l2_LR == 0) * 100 print "C=%d" % C print "Sparsity with L1 penalty: %.2f%%" % sparsity_l1_LR print "score with L1 penalty: %.4f" % clf_l1_LR.score(X, y) print "Sparsity with L2 penalty: %.2f%%" % sparsity_l2_LR print "score with L2 penalty: %.4f" % clf_l2_LR.score(X, y) l1_plot = pl.subplot(3, 2, 2 * i + 1) l2_plot = pl.subplot(3, 2, 2 * (i + 1)) if i == 0: l1_plot.set_title("L1 penalty") l2_plot.set_title("L2 penalty") l1_plot.imshow(np.abs(coef_l1_LR.reshape(8, 8)), interpolation='nearest', cmap='binary', vmax=1, vmin=0) l2_plot.imshow(np.abs(coef_l2_LR.reshape(8, 8)), interpolation='nearest', cmap='binary', vmax=1, vmin=0) pl.text(-8, 3, "C = %d" % C) l1_plot.set_xticks(()) l1_plot.set_yticks(()) l2_plot.set_xticks(()) l2_plot.set_yticks(()) pl.show() 其他官方用户指南中还包含了感知器,Passive Aggressive算法等。本文略去。 |