Linear Regression
2015-01-09 11:35 bootstar 阅读(682) 评论(0) 编辑 收藏 举报线性回归方法是机器学习里面最基础的一种方法,相关的理论方面的知识有很多,这里就不介绍了,博客主要从scikit-learn库的使用方面来探讨算法。
首先,我们使用随机生成一组数据,然后加入一些随机噪声。
1 import numpy as np 2 from sklearn.cross_validation import train_test_split 3 4 def f(x): 5 return np.sin(2 * np.pi * x) 6 7 x_plot = np.linspace(0, 1, 100) 8 9 n_samples = 100 10 X = np.random.uniform(0, 1, size=n_samples)[:, np.newaxis] 11 y = f(X) + np.random.normal(scale=0.3, size=n_samples)[:, np.newaxis] ##add random noise to the dataset 12 13 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)
首先,不添加正则项
1 fig, axes = plt.subplots(5, 2, figsize=(8, 5)) 2 train_error = np.empty(10) 3 test_error = np.empty(10) 4 # 5 for ax, degree in zip(axes.ravel(), range(10)): 6 est = make_pipeline(PolynomialFeatures(degree), LinearRegression()) 7 est.fit(X_train, y_train) 8 train_error[degree] = mean_squared_error(y_train, est.predict(X_train)) 9 test_error[degree] = mean_squared_error(y_test, est.predict(X_test)) 10 plot_approximation(est, ax, label='degree=%d' %degree) 11 plt.show(fig) 12 13 plt.plot(np.arange(10), train_error, color='green', label='train') 14 plt.plot(np.arange(10), test_error, color='red', label='test') 15 plt.ylim(0.0, 1e0) 16 plt.ylabel('log(mean squared error)') 17 plt.xlabel('degree') 18 plt.legend(loc="upper left") 19 plt.show()
误差为:
当多项式的最高次幂超过6之后,训练样本的误差小,测试样本的误差过大,出现了过拟合,下面加入L2正则项:
1 alphas = [0.0, 1e-8, 1e-5, 1e-1] 2 degree = 9 3 fig, ax_rows = plt.subplots(3, 4, figsize=(8, 5)) 4 for degree, ax_row in zip(range(7, 10), ax_rows): 5 for alpha, ax in zip(alphas, ax_row): 6 est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha)) 7 est.fit(X_train, y_train) 8 plot_approximation(est, ax, xlabel="degree=%d alpha=%r" %(degree, alpha)) 9 #plt.tight_layout() 10 plt.show(fig)
具体看看不同的alpha大小对多项式系数的影响:
1 def plot_coefficients(est, ax, label=None, yscale='log'): 2 coef = est.steps[-1][1].coef_.ravel() 3 if yscale == 'log': 4 ax.semilogy(np.abs(coef), marker='o', label=label) 5 ax.set_ylim((1e-1, 1e8)) 6 else: 7 ax.plot(np.abs(coef), marker='o', label=label) 8 ax.set_ylabel('abs(coefficient)') 9 ax.set_xlabel('coefficients') 10 ax.set_xlim((1, 9)) 11 12 fig, ax_rows = plt.subplots(4, 2, figsize=(8, 5)) 13 alphas = [0.0, 1e-8, 1e-5, 1e-1] 14 for alpha, ax_row in zip(alphas, ax_rows): 15 ax_left, ax_right = ax_row 16 est = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alpha)) 17 est.fit(X_train, y_train) 18 plot_approximation(est, ax_left, label='alpha=%r'%alpha) 19 plot_coefficients(est, ax_right, label='Ridge(alpha=%r) coefficients' % alpha ) 20 21 plt.show(fig)
alpha越大,因子越小,而曲线也越来越平滑。使用Ridge,可以加入L2正则项,还可以通过使用Lasso,加入L1正则项
1 fig, ax_rows = plt.subplots(2, 2, figsize=(8, 5)) 2 3 degree = 9 4 alphas = [1e-3, 1e-2] 5 6 for alpha, ax_row in zip(alphas, ax_rows): 7 ax_left, ax_right = ax_row 8 est = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha)) 9 est.fit(X_train, y_train) 10 plot_approximation(est, ax_left, label='alpha=%r' % alpha) 11 plot_coefficients(est, ax_right, label='Lasso(alpha=%r) coefficients' % alpha, yscale=None) 12 13 plt.tight_layout() 14 plt.show(fig)
除了上述两种方式外,scikit-learn还支持同时加入L1和L2正则,需要使用ElasticNet进行训练
1 fig, ax_rows = plt.subplots(8, 2, figsize=(8, 5)) 2 alphas = [1e-2, 1e-2, 1e-2, 1e-3, 1e-3, 1e-3, 1e-4, 1e-4] 3 ratios = [0.05, 0.85, 0.50, 0.05, 0.85, 0.50, 0.03, 0.95] 4 for alpha, ratio, ax_row in zip(alphas, ratios, ax_rows): 5 ax_left, ax_right = ax_row 6 est = make_pipeline(PolynomialFeatures(degree), ElasticNet(alpha=alpha, l1_ratio=ratio)) 7 est.fit(X_train, y_train) 8 plot_approximation(est, ax_left, label='alpha=%r ratio=%r' % (alpha, ratio)) 9 plot_coefficients(est, ax_right, label="Lasso(alpah=%r ratio=%r) coefficients" % (alpha, ratio), yscale=None) 10 11 plt.show()
当alpha一定时,曲线形状并未发生明显变化,alpha限定了参数范围,alpha越小,参数取值范围越大,这与只使用L2、L1正则时相似。ratio决定了参数的取值情况,当ratio比较大时,则参数相对稀疏(只有少数几个参数的值比较大,而其余的值比较小者趋近于0),
而ratio比较小时,参数之间差异相对较小,分布较为均匀。
数据集1:Test Scores for General Psychology
每组数据是一个四元组,<x1, x2, x3, x4>其中x1, x2, x3表示前3次的成绩,x4表示最终成绩。现在需要有(x1, x2, x3)来预测x4.数据集总共有25条记录。其中第一行是标题。下面对比不使用正则项,使用L2正则项和使用L1正则项来做一个简单的线性回归模型。
1 # -*-encoding:utf-8-*- 2 ''' 3 Created on 4 author: dstarer 5 copyright: dstarer 6 ''' 7 8 import numpy as np 9 from sklearn.cross_validation import train_test_split 10 from sklearn.linear_model import LinearRegression 11 from sklearn.metrics import mean_squared_error 12 from sklearn.linear_model import Ridge 13 from sklearn.linear_model import Lasso 14 from plot import * 15 16 def readData(filename, ignoreFirstLine=True, separtor='\t'): 17 dataSet = [] 18 fp = open(filename, "r") 19 if ignoreFirstLine: 20 fp.readline() 21 for line in fp.readlines(): 22 elements = map(int, line.strip().split(separtor)) 23 dataSet.append(elements) 24 fp.close() 25 return dataSet 26 27 28 def Print(message, train_error, test_error, coef): 29 print "%s--------------" % message 30 print "train error: %.3f" % train_error 31 print "test error: %.3f" % test_error 32 print coef 33 print "sum of coef: ", np.sum(coef) 34 35 def process(X, y, show=True): 36 error = np.empty(3) 37 X_train, X_test, y_train, y_test = train_test_split(X, y) 38 est = LinearRegression() 39 est.fit(X_train, y_train) 40 train_error = mean_squared_error(y_train, est.predict(X_train)) 41 test_error = mean_squared_error(y_test, est.predict(X_test)) 42 error[0] = test_error 43 44 if show: 45 Print(message="train without regularization", train_error=train_error, test_error=test_error, coef=est.coef_) 46 47 ridge = Ridge() 48 ridge.fit(X_train, y_train) 49 train_error = mean_squared_error(y_train, ridge.predict(X_train)) 50 test_error = mean_squared_error(y_test, ridge.predict(X_test)) 51 error[1] = test_error 52 53 if show: 54 Print(message="train using L2 regularization", train_error=train_error, test_error=test_error, coef=est.coef_) 55 56 lasso = Lasso() 57 lasso.fit(X_train, y_train) 58 train_error = mean_squared_error(y_train, lasso.predict(X_train)) 59 test_error = mean_squared_error(y_test, lasso.predict(X_test)) 60 error[2] = test_error 61 62 if show: 63 Print(message="train using L1 regularization", train_error=train_error, test_error=test_error, coef=est.coef_) 64 65 if show: 66 print "Data ------------" 67 print "[x1 x2 x3 ] [y] [\t est \t] [\t ridge \t] [\t lasso \t]" 68 for X, y, est_v, ridge_v, lasso_v in zip(X_test, y_test, est.predict(X_test), ridge.predict(X_test), lasso.predict(X_test)): 69 print X, y, est_v, ridge_v, lasso_v 70 71 return error 72 73 74 def error_estimate(X, y): 75 error = np.empty(3) 76 Iters = 20 77 78 for i in range(Iters): 79 tmp = process(X, y, show=False) 80 error = error + tmp 81 error /= Iters 82 print "normal error: %.3f" % error[0] 83 print "L2 error: %.3f" % error[1] 84 print "L1 error: %.3f" % error[2] 85 86 87 def extract_data(filename): 88 dataset = np.mat(readData(filename)) 89 90 y = dataset[:, -1] 91 X = dataset[:, :-1] 92 93 process(X, y, show=True) 94 95 # original data set 96 print "original data set:" 97 error_estimate(X, y) 98 99 print "using the first two dimensions" 100 X = dataset[:, :-2] 101 error_estimate(X, y) 102 103 print "use the first and third dimensions" 104 X = dataset[:, ::2] 105 error_estimate(X, y) 106 107 print "only use the third dimension" 108 X = dataset[:, 2] 109 error_estimate(X, y) 110 111 print "use the second and third dimensions" 112 X = dataset[:, 1:-1] 113 error_estimate(X, y) 114 115 #plot the data 116 ax = plt.gca() 117 X1 = dataset[:, 0] 118 X2 = dataset[:, 1] 119 X3 = dataset[:, 2] 120 plotScatter2D(ax=ax, X=X1, y=y, color="red") 121 plotScatter2D(ax=ax, X=X2, y=y, color="blue") 122 plotScatter2D(ax=ax, X=X3, y=y, color="green") 123 plt.show() 124 125 126 if '__main__' == __name__: 127 extract_data("E:\\dataset\\mldata\\test_score.csv")
1 train without regularization-------------- 2 train error: 2.457 3 test error: 16.153 4 [[ 0.39285405 0.5157764 1.16694498]] 5 sum of coef: 2.07557543476 6 train using L2 regularization-------------- 7 train error: 2.457 8 test error: 16.159 9 [[ 0.39285405 0.5157764 1.16694498]] 10 sum of coef: 2.07557543476 11 train using L1 regularization-------------- 12 train error: 2.466 13 test error: 16.038 14 [[ 0.39285405 0.5157764 1.16694498]] 15 sum of coef: 2.07557543476 16 Data ------------ 17 [x1 x2 x3 ] [y] [ est ] [ ridge ] [ lasso ] 18 [70 73 78] [148] [ 150.36924252] [ 150.36061654] 150.447200166 19 [78 75 68] [147] [ 142.87417786] [ 142.89774266] 142.910175633 20 [93 89 96] [192] [ 188.66231778] [ 188.65674942] 188.514072928 21 [93 88 93] [185] [ 184.64570642] [ 184.64589888] 184.502840202 22 [47 56 60] [115] [ 111.56039086] [ 111.54876013] 111.860472713 23 [87 79 90] [175] [ 174.14575956] [ 174.14218324] 174.036875299 24 [78 83 85] [175] [ 166.83845382] [ 166.82925063] 166.853488692 25 original data set: 26 normal error: 9.255 27 L2 error: 11.200 28 L1 error: 12.574 29 using the first two dimensions 30 normal error: 63.057 31 L2 error: 64.947 32 L1 error: 66.151 33 use the first and third dimensions 34 normal error: 23.051 35 L2 error: 23.057 36 L1 error: 23.230 37 only use the third dimension 38 normal error: 39.893 39 L2 error: 39.890 40 L1 error: 39.899 41 use the second and third dimensions 42 normal error: 12.268 43 L2 error: 12.265 44 L1 error: 12.260
上面是一些测试的结果,为了具体的看一下线性回归的效果,测试20次,每次数据随机划分,将测试误差绘制出来,如下图:
其中红色线是表示不加正则项的结果,不同划分下测试误差也有很大偏差。使用三种特征的组合,得到的效果总体上来说还是可以的,是不是还有其他方法会取得更好的效果呢?为了找到一种更好的预测方法,我分别从这三个特征中任选两个用于测试。测试结果已经显示在上面了,当然为了更严谨一些,我也分别测试了20次,每次也同样是随机划分数据,误差曲线如下:
第一幅图是(x1, x2),第二幅图是(x1, x3),第三幅图是(x2, x3)。很容易发现,只用(x2, x3)与同时使用(X1, X2, X3)的效果很相似!!!前面我们将训练的系数输出来了,其实不难从系数上发现,a3>a2>a1, a3>a2 + a1, 也就是说x3这个特征是最重要的,所以在只考虑x1,x2时,测试误差很大,而考虑了x3之后,误差就减小了,而同时用x2, x3时,数据集的主要特征基本被表征出来,所以此时效果基本与(x1, x2, x3)的结果相同。为了测试一下x3特征的重要性,我只使用x3特征,效果如下:
对比只使用x3,和同时使用(x1,x2),x3的效果要比较好。当然这里我的测试方式可能欠妥,但是从平均情况来看,还是可以反映出上面的结论,最后我们在分别看一下(xi, y)的分布情况:
总体上,x3的分布相对集中一些,而x2,x1相对较为离散,波动幅度较大。未完待续,下一数据集。。。。。。