简单线性回归

简单线性回归


思想简单,是许多强大的非线性模型的基础,具有很好的可解释性。背后有强大的数学理论支持。

其本质:寻找一条直线,最大程度地去拟合样本特征和样本输出标记之间的关系

mark

​ 之所以称为简单线性回归是因为样本特征只有一个。假设我们已经找到最佳拟合直线方程y=ax+b,对于每个样本点x(i)>根据直线方程我们就能预测出y(i),其真值为y,我们希望预测值与真值尽可能的小,y(i)>-y尽可能的小,但是其有可能有有负,因此我们将其进行平方。并且这个函数处处可导。

​ 目标:使$\sum_{i=1}m$(y-y<sub>(i)</sub>)[2]$尽可能的小,

​ 目标:找到a和b使得$\sum_{i=1}m$(y-ax+b)[2]尽可能小

​ 在机器学习中我们把上述的目标函数称为损失函数(loss function)或者称为效用函数(utility function),通过分析问题,确定损失函数或者效用函数,通过优化这些目标函数,最终获得机器学习模型。

​ 其实上述的目标函数就是一个最典型的最小二乘法问题,最下化误差的平方。

mark

一、最小二乘法

目标:找到a和b使得$\sum_{i=1}m$(y-ax+b)[2]尽可能小

mark

分别对a和b求偏导,并令其等于0。首先对b进行求偏导,b的计算较为简单。

mark

接下来对a求偏导。

mark

mark

​ 其实到这里,已经推到完了,a和b都有了自己的公式,就可以编程实现了。但是,往往在学习过程中,我们还会在把它化简一步。

mark

​ 这样的到a其实是为了更加方便使用向量化运算能够缩短程序的运行时间,效率高20倍以上。

二、实现简单的线性回归算法

​ 准备一些简单的数据。

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1., 2., 3., 4., 5.])
y = np.array([1., 3., 2., 3., 5.])

plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()
x_mean = np.mean(x)
y_mean = np.mean(y)
up = 0
down = 0
for x_i, y_i in zip(x, y):
    up += (x_i - x_mean) * (y_i - y_mean)
    down += (x_i - x_mean) ** 2    
a = up/down
b = y_mean - a * x_mean

​ 这样我们就求出了a和b,也就求出了直线方程y=a+b

y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, 'r')
plt.axis([0, 6, 0, 6])
plt.show()
# 接下来测试一个新的数据
x_new = 6
y_predict = a * x_new + b

输出结果:5.2

三、使用面向对象编程实现

import numpy as np

class SimpleLinearRegreesion1(object):

    def __init__(self):
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        up = 0
        down = 0
        for x_i, y_i in zip(x_train, y_train):
            up += (x_i - x_mean) * (y_i - y_mean)
            down += (x_i - x_mean) ** 2

        self.a_ = up / down
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
        "这个函数用来处理一次传入多个x_predict"
        return [self._predict(x) for x in x_predict]

    def _predict(self, x_single):
        "预测单个数据"
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression"


if __name__ == '__main__':
    import numpy as np

    x = np.array([1., 2., 3., 4., 5.])
    y = np.array([1., 3., 2., 3., 5.])
    x_new = 6

    reg1 = SimpleLinearRegreesion1()
    reg1.fit(x, y)
    y_predict = reg1.predict(np.array([x_new]))
    print(y_predict)
    print(reg1.a_)
    print(reg1.b_)

四、向量化运算

import numpy as np

class SimpleLinearRegreesion2(object):

    def __init__(self):
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        up = (x_train - x_mean).dot(y_train - y_mean)
        down = (x_train - x_mean).dot((x_train - x_mean))

        self.a_ = up / down
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
        "这个函数用来处理一次传入多个x_predict"
        return [self._predict(x) for x in x_predict]

    def _predict(self, x_single):
        "预测单个数据"
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression"

​ 接下来进行性能测试:

if __name__ == '__main__':
    import numpy as np
    import time

    m = 100000
    x = np.random.random(size=m)
    y = x * 2.0 + 3.0 + np.random.normal(0, 1, size=m)

    start1 = time.time()
    reg1 = SimpleLinearRegreesion1()
    reg1.fit(x, y)
    stop1 = time.time()
    print("运行时间:%s" % (stop1 - start1))

    start2 = time.time()
    reg2 = SimpleLinearRegreesion2()
    reg2.fit(x, y)
    stop2 = time.time()
    print("运行时间:%s" % (stop2 - start2))

输出结果:

运行时间:0.07222509384155273
运行时间:0.008754968643188477

五、线性回归算法的评价

1、均方误差(MSE)

​ 均方误差(Mean Squared Error)

mark

2、均方根误差(RMSE)

​ 均方根误差(Root Mean Squared Error)

mark

​ 这样,MSE就和y有着相同的量纲。

3、平均绝对误差(MAE)

​ 平均绝对误差(Mean Absolute Error)

mark

我们在最初的时候选择目标函数不是它,因为绝对值不是一个处处可导的函数,因此不能作为损失函数,但是作为评价函数还是相当不错的。

​ 接下里使用波士顿房价做简单线性回归,并使用这面三中评价函数对线性回归模型进行评价。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

boston = datasets.load_boston()
print(boston.DESCR)
boston.feature_names
x = boston.data[:, 5]   # 只是用房间数量这个特征
y = boston.target		# 波士顿房价
plt.scatter(x, y)
plt.show()

mark

​ 通过可视化发现最上方有一排异常点,所以在数据处理的过程中我们需要考虑,可能是因为在统计的过程中设计了上限,就是大于50万的都按50万计算,因此,数据的预处理就显得尤为关键。

x = x[y < np.max(y)]
y = y[y < np.max(y)]
plt.scatter(x, y)
plt.show()
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
x_train.shape
x_test.shape
%run D:/goodgoodstudy/linear_regression/simple_linear_regression.py
reg = SimpleLinearRegreesion2()
reg.fit(x_train, y_train)
reg.a_
reg.b_
plt.scatter(x_train, y_train)
plt.plot(x_train, reg.predict(x_train), 'r')
plt.show()

​ 接下来测试三种评价函数:

y_predict = reg.predict(x_test)
mes_test = np.sum((y_predict - y_test)**2) / len(y_test)
mes_test

​ 输出结果:28.215949368640796

from math import sqrt
rmese_test = sqrt(mes_test)
rmese_test

​ 输出结果:5.311868726600912

mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
mae_test

​ 输出结果:3.9489046062737843

使用函数把这三个评价函数封装一下:

import numpy as np
from math import sqrt

def mean_square_error(y_true, y_predict):
    assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"

    return np.sum((y_predict - y_true)**2) / len(y_true)

def root_mean_square_error(y_true, y_predict):
    return sqrt(mean_square_error(y_true, y_predict))

def mean_absolute_error(y_true, y_predict):
    assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"

    return np.sum(np.absolute(y_predict - y_true)) / len(y_true)

​ 其实,在sklearn中又已经封装好了,那威慑呢么还要写一遍呢?因为不能仅仅只会调库,底层代码还是要知道是怎么实现的。那么我们看看sklearn中封装的函数和自己写的输出是否一致。

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

mean_absolute_error(y_test, y_predict)
mean_squared_error(y_test, y_predict)

六、最好的衡量线性回归算法的指标

R squared

​ 前面已经讲了三个评价线性回归模型的函数,为什么还会有最好的衡量线性回归算法的指标呢?我们想一下分类问题。1是完全正确,0是错误,这样我们就能大概知道分类的情况,但是如果使用上面三个评价函数,我们并不能知道差了多少。只知道有误差,所以我们就像办法把误差范围也归一到0-1之间。

mark

mark

​ 如果我们的模型预测产生的错误为0或者很小也就是后面一项基本为0,就是R[2]约等于1,模型很好。如果模型预测产生的错误很大,基本跟用均值预测产生的误差接近,后面一项基本为1,就是R[2]越等与0,模型很差。如果我们的模型预测产生的错误比用均值产生的错误还大,那就没什么意义了,还不如直接用均值预测,那就没必要了,**如果R[2]小于0,很有可能就是因为我们的数据根本不存在线性关系**。因此R[2]一定不会小于0,因此,我们成用均值预测的模型称为Baseline Model。这就是底线,基本不用学习就可以通过统计得到的模型。

mark

​ 通过上面这个式子,可以发现R[^2]背后还存在着重大的数学统计意义。接下来就用编程去实现它。

def r2_score(y_true, y_predict):

    return 1 - mean_square_error(y_true, y_predict) / np.var(y_true)

sklearn中:

from sklearn.metrics import r2_score

r2_score(y_test, y_predict)

最后,我们再把这个加到我们封装好的简单线性回归的类中。

import numpy as np
from .metrics import r2_score

class SimpleLinearRegreesion1(object):

    def __init__(self):
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        up = 0
        down = 0
        for x_i, y_i in zip(x_train, y_train):
            up += (x_i - x_mean) * (y_i - y_mean)
            down += (x_i - x_mean) ** 2

        self.a_ = up / down
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
        "这个函数用来处理一次传入多个x_predict"
        return [self._predict(x) for x in x_predict]

    def _predict(self, x_single):
        "预测单个数据"
        return self.a_ * x_single + self.b_

    def __repr__(self):
        return "SimpleLinearRegression"


class SimpleLinearRegreesion2(object):

    def __init__(self):
        self.a_ = None
        self.b_ = None

    def fit(self, x_train, y_train):
        assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"

        x_mean = np.mean(x_train)
        y_mean = np.mean(y_train)

        up = (x_train - x_mean).dot(y_train - y_mean)
        down = (x_train - x_mean).dot((x_train - x_mean))

        self.a_ = up / down
        self.b_ = y_mean - self.a_ * x_mean
        return self

    def predict(self, x_predict):
        assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
        assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
        "这个函数用来处理一次传入多个x_predict"
        return [self._predict(x) for x in x_predict]

    def _predict(self, x_single):
        "预测单个数据"
        return self.a_ * x_single + self.b_

    def score(self, x_test, y_test):

        y_predict = self.predict(x_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return "SimpleLinearRegression"

​ 最后,简单线性回归就到这里了!想法思想很简单,但是其中数学统计知识还是值得我们学习和思考的。

我是尾巴

mark

本次推荐:集万千工具为一体的网站。

http://tool.uixsj.cn/

mark

继续加油!!!今天又是元气满满的一天!

posted @ 2019-08-04 11:04  m1racle  阅读(837)  评论(0编辑  收藏  举报