简单线性回归
简单线性回归
思想简单,是许多强大的非线性模型的基础,具有很好的可解释性。背后有强大的数学理论支持。
其本质:寻找一条直线,最大程度地去拟合样本特征和样本输出标记之间的关系
之所以称为简单线性回归是因为样本特征只有一个。假设我们已经找到最佳拟合直线方程y=ax+b,对于每个样本点x(i)>根据直线方程我们就能预测出y(i),其真值为y,我们希望预测值与真值尽可能的小,y(i)>-y尽可能的小,但是其有可能有有负,因此我们将其进行平方。并且这个函数处处可导。
目标:使$\sum_{i=1}m$(y-y<sub>(i)</sub>)[2]$尽可能的小,
目标:找到a和b使得$\sum_{i=1}m$(y-ax+b)[2]尽可能小
在机器学习中我们把上述的目标函数称为损失函数(loss function)或者称为效用函数(utility function),通过分析问题,确定损失函数或者效用函数,通过优化这些目标函数,最终获得机器学习模型。
其实上述的目标函数就是一个最典型的最小二乘法问题,最下化误差的平方。
一、最小二乘法
目标:找到a和b使得$\sum_{i=1}m$(y-ax+b)[2]尽可能小
分别对a和b求偏导,并令其等于0。首先对b进行求偏导,b的计算较为简单。
接下来对a求偏导。
其实到这里,已经推到完了,a和b都有了自己的公式,就可以编程实现了。但是,往往在学习过程中,我们还会在把它化简一步。
这样的到a其实是为了更加方便使用向量化运算能够缩短程序的运行时间,效率高20倍以上。
二、实现简单的线性回归算法
准备一些简单的数据。
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()
x_mean = np.mean(x)
y_mean = np.mean(y)
up = 0
down = 0
for x_i, y_i in zip(x, y):
up += (x_i - x_mean) * (y_i - y_mean)
down += (x_i - x_mean) ** 2
a = up/down
b = y_mean - a * x_mean
这样我们就求出了a和b,也就求出了直线方程y=a+b
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, 'r')
plt.axis([0, 6, 0, 6])
plt.show()
# 接下来测试一个新的数据
x_new = 6
y_predict = a * x_new + b
输出结果:5.2
三、使用面向对象编程实现
import numpy as np
class SimpleLinearRegreesion1(object):
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
up = 0
down = 0
for x_i, y_i in zip(x_train, y_train):
up += (x_i - x_mean) * (y_i - y_mean)
down += (x_i - x_mean) ** 2
self.a_ = up / down
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
"这个函数用来处理一次传入多个x_predict"
return [self._predict(x) for x in x_predict]
def _predict(self, x_single):
"预测单个数据"
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression"
if __name__ == '__main__':
import numpy as np
x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])
x_new = 6
reg1 = SimpleLinearRegreesion1()
reg1.fit(x, y)
y_predict = reg1.predict(np.array([x_new]))
print(y_predict)
print(reg1.a_)
print(reg1.b_)
四、向量化运算
import numpy as np
class SimpleLinearRegreesion2(object):
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
up = (x_train - x_mean).dot(y_train - y_mean)
down = (x_train - x_mean).dot((x_train - x_mean))
self.a_ = up / down
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
"这个函数用来处理一次传入多个x_predict"
return [self._predict(x) for x in x_predict]
def _predict(self, x_single):
"预测单个数据"
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression"
接下来进行性能测试:
if __name__ == '__main__':
import numpy as np
import time
m = 100000
x = np.random.random(size=m)
y = x * 2.0 + 3.0 + np.random.normal(0, 1, size=m)
start1 = time.time()
reg1 = SimpleLinearRegreesion1()
reg1.fit(x, y)
stop1 = time.time()
print("运行时间:%s" % (stop1 - start1))
start2 = time.time()
reg2 = SimpleLinearRegreesion2()
reg2.fit(x, y)
stop2 = time.time()
print("运行时间:%s" % (stop2 - start2))
输出结果:
运行时间:0.07222509384155273
运行时间:0.008754968643188477
五、线性回归算法的评价
1、均方误差(MSE)
均方误差(Mean Squared Error)
2、均方根误差(RMSE)
均方根误差(Root Mean Squared Error)
这样,MSE就和y有着相同的量纲。
3、平均绝对误差(MAE)
平均绝对误差(Mean Absolute Error)
我们在最初的时候选择目标函数不是它,因为绝对值不是一个处处可导的函数,因此不能作为损失函数,但是作为评价函数还是相当不错的。
接下里使用波士顿房价做简单线性回归,并使用这面三中评价函数对线性回归模型进行评价。
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
boston = datasets.load_boston()
print(boston.DESCR)
boston.feature_names
x = boston.data[:, 5] # 只是用房间数量这个特征
y = boston.target # 波士顿房价
plt.scatter(x, y)
plt.show()
通过可视化发现最上方有一排异常点,所以在数据处理的过程中我们需要考虑,可能是因为在统计的过程中设计了上限,就是大于50万的都按50万计算,因此,数据的预处理就显得尤为关键。
x = x[y < np.max(y)]
y = y[y < np.max(y)]
plt.scatter(x, y)
plt.show()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
x_train.shape
x_test.shape
%run D:/goodgoodstudy/linear_regression/simple_linear_regression.py
reg = SimpleLinearRegreesion2()
reg.fit(x_train, y_train)
reg.a_
reg.b_
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), 'r')
plt.show()
接下来测试三种评价函数:
y_predict = reg.predict(x_test)
mes_test = np.sum((y_predict - y_test)**2) / len(y_test)
mes_test
输出结果:28.215949368640796
from math import sqrt
rmese_test = sqrt(mes_test)
rmese_test
输出结果:5.311868726600912
mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
mae_test
输出结果:3.9489046062737843
使用函数把这三个评价函数封装一下:
import numpy as np
from math import sqrt
def mean_square_error(y_true, y_predict):
assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"
return np.sum((y_predict - y_true)**2) / len(y_true)
def root_mean_square_error(y_true, y_predict):
return sqrt(mean_square_error(y_true, y_predict))
def mean_absolute_error(y_true, y_predict):
assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"
return np.sum(np.absolute(y_predict - y_true)) / len(y_true)
其实,在sklearn中又已经封装好了,那威慑呢么还要写一遍呢?因为不能仅仅只会调库,底层代码还是要知道是怎么实现的。那么我们看看sklearn中封装的函数和自己写的输出是否一致。
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
mean_absolute_error(y_test, y_predict)
mean_squared_error(y_test, y_predict)
六、最好的衡量线性回归算法的指标
R squared
前面已经讲了三个评价线性回归模型的函数,为什么还会有最好的衡量线性回归算法的指标呢?我们想一下分类问题。1是完全正确,0是错误,这样我们就能大概知道分类的情况,但是如果使用上面三个评价函数,我们并不能知道差了多少。只知道有误差,所以我们就像办法把误差范围也归一到0-1之间。
如果我们的模型预测产生的错误为0或者很小也就是后面一项基本为0,就是R[2]约等于1,模型很好。如果模型预测产生的错误很大,基本跟用均值预测产生的误差接近,后面一项基本为1,就是R[2]越等与0,模型很差。如果我们的模型预测产生的错误比用均值产生的错误还大,那就没什么意义了,还不如直接用均值预测,那就没必要了,**如果R[2]小于0,很有可能就是因为我们的数据根本不存在线性关系**。因此R[2]一定不会小于0,因此,我们成用均值预测的模型称为Baseline Model。这就是底线,基本不用学习就可以通过统计得到的模型。
通过上面这个式子,可以发现R[^2]背后还存在着重大的数学统计意义。接下来就用编程去实现它。
def r2_score(y_true, y_predict):
return 1 - mean_square_error(y_true, y_predict) / np.var(y_true)
sklearn中:
from sklearn.metrics import r2_score
r2_score(y_test, y_predict)
最后,我们再把这个加到我们封装好的简单线性回归的类中。
import numpy as np
from .metrics import r2_score
class SimpleLinearRegreesion1(object):
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
up = 0
down = 0
for x_i, y_i in zip(x_train, y_train):
up += (x_i - x_mean) * (y_i - y_mean)
down += (x_i - x_mean) ** 2
self.a_ = up / down
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
"这个函数用来处理一次传入多个x_predict"
return [self._predict(x) for x in x_predict]
def _predict(self, x_single):
"预测单个数据"
return self.a_ * x_single + self.b_
def __repr__(self):
return "SimpleLinearRegression"
class SimpleLinearRegreesion2(object):
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
up = (x_train - x_mean).dot(y_train - y_mean)
down = (x_train - x_mean).dot((x_train - x_mean))
self.a_ = up / down
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
"这个函数用来处理一次传入多个x_predict"
return [self._predict(x) for x in x_predict]
def _predict(self, x_single):
"预测单个数据"
return self.a_ * x_single + self.b_
def score(self, x_test, y_test):
y_predict = self.predict(x_test)
return r2_score(y_test, y_predict)
def __repr__(self):
return "SimpleLinearRegression"
最后,简单线性回归就到这里了!想法思想很简单,但是其中数学统计知识还是值得我们学习和思考的。
我是尾巴
本次推荐:集万千工具为一体的网站。
继续加油!!!今天又是元气满满的一天!