微信扫一扫打赏支持

机器学习准备---1、简单线性回归(最小二乘法实例)

机器学习准备---1、简单线性回归(最小二乘法实例)

打赏

 

一、总结

一句话总结:

1、在本例中,最小二乘法就是计算损失的,就是求出w和b之后计算这对w和b对应的损失(因为本例中w和b是用公式可以求的)
2、而在tensorflow2的例子中,因为w和b是多次试探,所以每次试探的结果就是使最小二乘法对应的损失函数最小

 

 

1、损失函数就是用最小二乘法来做的?

核心代码:total_cost += (y-w*x-b)**2
复制代码
# 损失函数是系数的函数,另外还要传入数据的x,y
def compute_cost(w,b,points):
    total_cost=0
    M =len(points)
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        # y1=wx+b
        # 最小二乘法也就是求(y-y1)^2 
        total_cost += (y-w*x-b)**2
        print("i={}, x={}, y={}, y-w*x-b={}, total_cost={}".format(i,x,y,y-w*x-b,total_cost))
    return total_cost/M #一除都是浮点 两个除号是地板除,整型。 如 3 // 4
复制代码

 

 

2、做拟合就是求出w和b,而w和b是有公式可以计算的?

核心代码:w = sum_yx/(sum_x2-M*(x_bar**2))
复制代码
#定义核心拟合函数
# 也就是求w和b
def fit1(points):
    M = len(points)
    x_bar=np.mean(points.iloc[:,1])
    y_bar=np.mean(points.iloc[:,2])
    # print("x_bar={}".format(x_bar))
    sum_yx= 0
    sum_x2=0
    sum_delta =0
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_yx += y*x-x_bar*y_bar
        sum_x2 += x**2
    #根据公式计算w
    w = sum_yx/(sum_x2-M*(x_bar**2))
    
    # b的求法就是:b = 总误差 / 总样本数 
    
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_delta += (y-w*x)
    b = sum_delta / M
    return w,b
# 4.测试
points = data
w,b =fit1(points)
print ("w is :",w)
print ("b is :",b)
复制代码

 

 

 

二、简单线性回归(最小二乘法实例)

博客对应课程的视频位置:

 

# 0.引入依赖
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
# 1.导入数据
data = pd.read_csv('../dataset/income.csv')
data
Out[2]:
 Unnamed: 0EducationIncome
0 1 10.000000 26.658839
1 2 10.401338 27.306435
2 3 10.842809 22.132410
3 4 11.244147 21.169841
4 5 11.645449 15.192634
5 6 12.086957 26.398951
6 7 12.048829 17.435307
7 8 12.889632 25.507885
8 9 13.290970 36.884595
9 10 13.732441 39.666109
10 11 14.133779 34.396281
11 12 14.635117 41.497994
12 13 14.978589 44.981575
13 14 15.377926 47.039595
14 15 15.779264 48.252578
15 16 16.220736 57.034251
16 17 16.622074 51.490919
17 18 17.023411 51.336621
18 19 17.464883 57.681998
19 20 17.866221 68.553714
20 21 18.267559 64.310925
21 22 18.709030 68.959009
22 23 19.110368 74.614639
23 24 19.511706 71.867195
24 25 19.913043 76.098135
25 26 20.354515 75.775216
26 27 20.755853 72.486055
27 28 21.167191 77.355021
28 29 21.598662 72.118790
29 30 22.000000 80.260571
In [3]:
print(data.iloc[0,0])
print(data.iloc[1,1])
print(data.iloc[2,2])
print(len(data))
1
10.401338
22.13241
30
In [4]:
%matplotlib inline
plt.scatter(data.Education,data.Income)
Out[4]:
<matplotlib.collections.PathCollection at 0x15282d68508>
In [5]:
# 2.定义损失函数
# 损失函数是系数的函数,另外还要传入数据的x,y
def compute_cost(w,b,points):
    total_cost=0
    M =len(points)
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        # y1=wx+b
        # 最小二乘法也就是求(y-y1)^2 
        total_cost += (y-w*x-b)**2
        print("i={}, x={}, y={}, y-w*x-b={}, total_cost={}".format(i,x,y,y-w*x-b,total_cost))
    return total_cost/M #一除都是浮点 两个除号是地板除,整型。 如 3 // 4
In [23]:
# 测试损失函数
compute_cost(1,1,data)
i=0, x=10.0, y=26.658839, y-w*x-b=15.658839, total_cost=245.19923882792102
i=1, x=10.401338, y=27.306434999999997, y-w*x-b=15.905096999999998, total_cost=498.17134940732996
i=2, x=10.842808999999999, y=22.13241, y-w*x-b=10.289601000000001, total_cost=604.047238146531
i=3, x=11.244147, y=21.169841, y-w*x-b=8.925694000000002, total_cost=683.715251528167
i=4, x=11.645449000000001, y=15.192634, y-w*x-b=2.547184999999999, total_cost=690.2034029523919
i=5, x=12.086957, y=26.398951, y-w*x-b=13.311994, total_cost=867.4125872084279
i=6, x=12.048829, y=17.435307, y-w*x-b=4.386478000000002, total_cost=886.6537764529119
i=7, x=12.889632, y=25.507885, y-w*x-b=11.618253000000001, total_cost=1021.637579224921
i=8, x=13.290970000000002, y=36.884595000000004, y-w*x-b=22.593625000000003, total_cost=1532.109469865546
i=9, x=13.732441, y=39.666109000000006, y-w*x-b=24.933668000000004, total_cost=2153.79726979977
i=10, x=14.133779, y=34.396281, y-w*x-b=19.262502, total_cost=2524.8412530997743
i=11, x=14.635117000000001, y=41.497994, y-w*x-b=25.862876999999997, total_cost=3193.729659816903
i=12, x=14.978589000000001, y=44.981575, y-w*x-b=29.002986, total_cost=4034.9028567330993
i=13, x=15.377926, y=47.039595, y-w*x-b=30.661668999999996, total_cost=4975.04080259866
i=14, x=15.779264000000001, y=48.252578, y-w*x-b=31.473314000000002, total_cost=5965.610296741256
i=15, x=16.220736, y=57.034251, y-w*x-b=39.813514999999995, total_cost=7550.726273396481
i=16, x=16.622073999999998, y=51.490919, y-w*x-b=33.868845, total_cost=8697.824935030505
i=17, x=17.023411, y=51.336621, y-w*x-b=33.31321, total_cost=9807.594895534605
i=18, x=17.464883, y=57.681998, y-w*x-b=39.217115, total_cost=11345.57700445783
i=19, x=17.866221, y=68.553714, y-w*x-b=49.687493, total_cost=13814.423965082879
i=20, x=18.267559, y=64.310925, y-w*x-b=45.043366, total_cost=15843.328785692835
i=21, x=18.70903, y=68.959009, y-w*x-b=49.249978999999996, total_cost=18268.889217193275
i=22, x=19.110367999999998, y=74.614639, y-w*x-b=54.504271, total_cost=21239.604774434716
i=23, x=19.511706, y=71.867195, y-w*x-b=51.35548899999999, total_cost=23876.991024863837
i=24, x=19.913043, y=76.098135, y-w*x-b=55.185092, total_cost=26922.3854039123
i=25, x=20.354515, y=75.775216, y-w*x-b=54.420701, total_cost=29883.998101243702
i=26, x=20.755853, y=72.48605500000001, y-w*x-b=50.730202000000006, total_cost=32457.551496204505
i=27, x=21.167191, y=77.355021, y-w*x-b=55.18782999999999, total_cost=35503.24807631341
i=28, x=21.598662, y=72.11879, y-w*x-b=49.520128, total_cost=37955.491153449795
i=29, x=22.0, y=80.260571, y-w*x-b=57.260571, total_cost=41234.26414469584
Out[23]:
1374.4754714898613

用第二个公式求 b帽子

In [6]:
# 3.定义核心算法拟合函数
# 先定义一个求均值的函数 问题 求均值是不是可以直接用np.mean(data)来实现?
# def average(data):
#     sum=0
#     num=len(data)
#     for i in range(num):
#         sum += data[i]
#     return sum/num
# print(average(x))
# print(np.mean(x))
#打印出来结果一样,可以通用。

#定义核心拟合函数
# 也就是求w和b
def fit(points):
    M = len(points)
    x_bar=np.mean(points.iloc[:,1])
    # print("x_bar={}".format(x_bar))
    sum_yx= 0
    sum_x2=0
    sum_delta =0
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_yx += y*(x-x_bar)
        # 与sum_yx += y*x-x_bar*y_bar 是一样的
        # x_bar*y_bar=x_bar*n*(y1+y2+..+yn)/n=x_bar*yi
        sum_x2 += x**2
    #根据公式计算w
    w = sum_yx/(sum_x2-M*(x_bar**2))
    
    # b的求法就是:b = 总误差 / 总样本数 
    
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_delta += (y-w*x)
    b = sum_delta / M
    return w,b
In [7]:
#定义核心拟合函数
# 也就是求w和b
def fit1(points):
    M = len(points)
    x_bar=np.mean(points.iloc[:,1])
    y_bar=np.mean(points.iloc[:,2])
    # print("x_bar={}".format(x_bar))
    sum_yx= 0
    sum_x2=0
    sum_delta =0
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_yx += y*x-x_bar*y_bar
        sum_x2 += x**2
    #根据公式计算w
    w = sum_yx/(sum_x2-M*(x_bar**2))
    
    # b的求法就是:b = 总误差 / 总样本数 
    
    for i in range(M):
        x=points.iloc[i,1]
        y=points.iloc[i,2]
        sum_delta += (y-w*x)
    b = sum_delta / M
    return w,b
# 4.测试
points = data
w,b =fit1(points)
print ("w is :",w)
print ("b is :",b)
w is : 5.564068237721681
b is : -39.14888093981615
In [8]:
# 4.测试
points = data
w,b =fit(points)
print ("w is :",w)
print ("b is :",b)
cost = compute_cost(w,b,points)
print("cost is :" ,cost)
w is : 5.564068237721692
b is : -39.14888093981633
i=0, x=10.0, y=26.658839, y-w*x-b=10.167037562599411, total_cost=103.36865279930737
i=1, x=10.401338, y=27.306434999999997, y-w*x-b=8.581561544208657, total_cost=177.01185133634823
i=2, x=10.842808999999999, y=22.13241, y-w*x-b=0.9511617752334374, total_cost=177.91656005901345
i=3, x=11.244147, y=21.169841, y-w*x-b=-2.244479243157322, total_cost=182.95424713197752
i=4, x=11.645449000000001, y=15.192634, y-w*x-b=-10.454557955091524, total_cost=292.25202916834496
i=5, x=12.086957, y=26.398951, y-w*x-b=-1.704821594591543, total_cost=295.1584458377306
i=6, x=12.048829, y=17.435307, y-w*x-b=-10.45631880082368, total_cost=404.49304870218936
i=7, x=12.889632, y=25.507885, y-w*x-b=-7.062026067304792, total_cost=454.3652608774818
i=8, x=13.290970000000002, y=36.884595000000004, y-w*x-b=2.081611914304453, total_cost=458.698369039256
i=9, x=13.732441, y=39.666109000000006, y-w*x-b=2.406751145329231, total_cost=464.4908201147996
i=10, x=14.133779, y=34.396281, y-w*x-b=-5.09614887306153, total_cost=490.4615534512059
i=11, x=14.635117000000001, y=41.497994, y-w*x-b=-0.783914715224455, total_cost=491.07607573195133
i=12, x=14.978589000000001, y=44.981575, y-w*x-b=0.7885646390288059, total_cost=491.69790992187797
i=13, x=15.377926, y=47.039595, y-w*x-b=0.6246463211817428, total_cost=492.08809294844383
i=14, x=15.779264000000001, y=48.252578, y-w*x-b=-0.39544269720901326, total_cost=492.24446787521975
i=15, x=16.220736, y=57.034251, y-w*x-b=5.929849969747529, total_cost=527.4075885389345
i=16, x=16.622073999999998, y=51.490919, y-w*x-b=-1.8465540486432133, total_cost=530.8173503934952
i=17, x=17.023411, y=51.336621, y-w*x-b=-4.233918502965736, total_cost=548.7434162832508
i=18, x=17.464883, y=57.681998, y-w*x-b=-0.3449218360092061, total_cost=548.8623873562068
i=19, x=17.866221, y=68.553714, y-w*x-b=8.29372214560005, total_cost=617.6482143846235
i=20, x=18.267559, y=64.310925, y-w*x-b=1.817861127209305, total_cost=620.9528334624422
i=21, x=18.70903, y=68.959009, y-w*x-b=4.009570358234065, total_cost=637.0294879200715
i=22, x=19.110367999999998, y=74.614639, y-w*x-b=7.432128339843324, total_cost=692.2660195799738
i=23, x=19.511706, y=71.867195, y-w*x-b=2.451612321452565, total_cost=698.2764225546719
i=24, x=19.913043, y=76.098135, y-w*x-b=4.449485867130072, total_cost=718.0743470364621
i=25, x=20.354515, y=75.775216, y-w*x-b=1.6701865340865893, total_cost=720.8638700951062
i=26, x=20.755853, y=72.48605500000001, y-w*x-b=-3.8520464843041466, total_cost=735.7021322123462
i=27, x=21.167191, y=77.355021, y-w*x-b=-1.2717931850721342, total_cost=737.3195901179421
i=28, x=21.598662, y=72.11879, y-w*x-b=-8.908758271670145, total_cost=816.6855640609933
i=29, x=22.0, y=80.260571, y-w*x-b=-3.0000492900608933, total_cost=825.6858598037882
cost is : 27.522861993459607
In [26]:
x = data.Education
y = data.Income
plt.scatter(x,y)
pred_y= w*x+b
plt.plot(x,pred_y,c='r')
Out[26]:
[<matplotlib.lines.Line2D at 0x1e2c64637c8>]

为什么用tersorflow2做的线性回归的效果没有本例中的好

因为本例是使用公式直接计算的w和b,而tensorflow2中是一个值一个值的去找的

In [ ]:
 

 

 

 

 
posted @   范仁义  阅读(241)  评论(0编辑  收藏  举报
侧边栏

打赏

点击右上角即可分享
微信分享提示