【Machine Learning】Logistic 回归

 

logistic

 

 

 
Logistic 回归

 

1. 梯度上升算法

In [1]:
import numpy as np
from matplotlib import pyplot as plt
 

1.1 sigmoid 函数: 一个能向两边收敛的函数, 用作Logistic回归的分类器函数

Sigmoid函数的输入记为z,由下面公式得出: image.png 它表示将这两个数值向量对应元素相乘然后 全部加起来即得到z值。其中的向量x是分类器的输入数据,向量w也就是我们要找到的最佳参数 (系数),从而使得分类器尽可能地精确。

In [2]:
def sigmoid(intX):
    return 1 / (1 + np.exp(-intX))

def plotSigmoid():
    x = np.arange(-20.0, 20.0, 0.1)
    y = sigmoid(x)
    plt.plot(x, y)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()
    
plotSigmoid()
 
 

1.2 读取数据

In [3]:
def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat, labelMat
dataMat, labelMat = loadDataSet()
 

1.3 梯度上升算法

梯度上升算法的迭代公式如下: image.png 该公式将一直被迭代执行,直至达到某个停止条件为止,比如迭代次数达到某个指定值或算 法达到某个可以允许的误差范围。

In [4]:
def gradAscent(dataMatIn, classLabels):
    # 将数组转为矩阵
    dataMatrix = np.mat(dataMatIn)
    labelMatrix = np.mat(classLabels).transpose()
    m, n = np.shape(dataMatrix)
    alpha = 0.001    # 步长
    maxCycles = 500    # 迭代次数
    weights = np.ones((n, 1))    # 将系数都初始化为 1
    # 进行迭代
    for k in range(500):
        h = sigmoid(dataMatrix * weights)
        error = labelMatrix - h
        weights = weights + alpha * dataMatrix.transpose() * error
    return weights
 

1.4 测试梯度上升算法

In [5]:
data, labels = loadDataSet()
res = gradAscent(data, labels)
print(res)
print(type(res))
 
[[ 4.12414349]
 [ 0.48007329]
 [-0.6168482 ]]
<class 'numpy.matrix'>
 

1.5 绘制分类边界

In [6]:
def plotBestFit(dataMat, labelMat, wei):
    dataArr = np.array(dataMat)
    n = np.shape(dataArr)[0]
    xcord1 = []; ycord1 = []    # 存储分类为 1 的点
    xcord2 = []; ycord2 = []    # 存储分类为 0 的点
    for i in range(n):
        if int(labelMat[i]) == 1:
            xcord1.append(dataArr[i, 1])
            ycord1.append(dataArr[i, 2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(xcord1, ycord1, c='red', marker='s')    # 绘制分类为 1 的点 红色方块
    ax.scatter(xcord2, ycord2, c='green')    # 绘制分类为 0 的点 绿色圆点
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-wei[0] - wei[1] * x)/ wei[2]
    ax.plot(x, y)    # 绘制分类边界
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()

plotBestFit(data, labels, res.getA())
 
 

2 随机梯度上升算法

 

2.1 算法实现和测试

梯度上升算法每次更新回归系数时都要遍历整个数据集,当数据量太大时,计算复杂度过高。

In [7]:
def stocGradAscent0(dataMatrix, classLabels):
    m, n = np.shape(dataMatrix)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(m):
        h = sigmoid(sum(dataMatrix[i] * weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

res = stocGradAscent0(np.array(data), labels)
print(res)
plotBestFit(np.array(data), labels, res)
 
[ 1.01702007  0.85914348 -0.36579921]
 
 

2.2 随机梯度上升算法优化

随机梯度上升算法将 1.3 中的算法从矩阵计算简化为数值计算, 但是迭代时仍需要遍历每个样本,而且步长是固定值,在此基础上,我们可以在迭代过程中改变步长,并且随机抽取部分样本进行训练。

 
2.2.1 优化算法
In [8]:
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m, n = np.shape(dataMatrix)
    weights = np.ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in dataIndex:
            alpha = 4 / (1.0 + j + i) + 0.01    # 改变步长
            randIndex = int(np.random.uniform(0, len(dataIndex)))    # 随机抽取样本
            h = sigmoid(np.sum(dataMatrix[randIndex] * weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights
 
2.2.2 算法测试
In [9]:
# 使用默认的迭代次数 150
res = stocGradAscent1(np.array(data), labels)
print(res)
plotBestFit(np.array(data), labels, res)
 
[13.93026891  0.85427134 -1.73867894]
 
In [10]:
# 更改迭代次数为 200
res = stocGradAscent1(np.array(data), labels, 200)
print(res)
plotBestFit(np.array(data), labels, res)
 
[13.23760085  1.10206704 -1.74231449]
 
 

3. 从疝气病症预测病马的死亡率

 

3.1 处理数据中的缺失值

常用方法:
     使用可用特征的均值来填补缺失值;
     使用特殊值来填补缺失值,如 1;
     忽略有缺失值的样本;
     使用相似样本的均值添补缺失值;
     使用另外的机器学习算法预测缺失值。
 

3.2 读取数据

In [11]:
def loadDataSet(path):
    dataMat = []; labelMat = []
    fr = open(path)
    for line in fr.readlines():
        currentLine = line.strip().split('\t')
        lineArr = []
        length = len(currentLine)
        for i in range(length - 1):
            lineArr.append(float(currentLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(currentLine[length - 1]))
    return dataMat, labelMat

trainData, trainLabel = loadDataSet('horseColicTraining.txt')
testData, testLabel = loadDataSet('horseColicTest.txt')
 

3.2 分类函数

In [12]:
def classifyVector(intX, weights):
    prob = sigmoid(np.sum(intX * weights))
    return 1.0 if prob > 0.5 else 0.0

def stocGradAscentClassify(trainDataMat, trainLabelMat, testDataMat, numIter=200):
    weights = stocGradAscent1(np.array(trainDataMat), trainLabelMat, numIter)
    predicts = []
    for i in range(len(testDataMat)):
        predicts.append(classifyVector(testDataMat[i], weights))
    return predicts

res = stocGradAscentClassify(trainData, trainLabel, testData, 200)
print(res)
 
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in exp
  
 
[1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
 

3.3 对比结果,计算错误率

In [13]:
def errorRates(testLabelMat, predictMat):
    errorNum = 0
    total = len(predictMat)
    for i in range(total):
        if testLabel[i] != predictMat[i]:
            errorNum += 1
    return float(errorNum) / total
print(errorRates(testLabel, res))
 
0.43283582089552236
 

3.4 更改迭代次数,测试算法

In [14]:
def multiTest():
    trainData, trainLabel = loadDataSet('horseColicTraining.txt')
    testData, testLabel = loadDataSet('horseColicTest.txt')
    for i in range(1,10):
        numIter = i * 100
        predicts = stocGradAscentClassify(trainData, trainLabel, testData, numIter)
        print('Iter number: %d, Error rate: %f' % (numIter, errorRates(testLabel, predicts)))

multiTest()
    
 
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in exp
  
 
Iter number: 100, Error rate: 0.507463
Iter number: 200, Error rate: 0.388060
Iter number: 300, Error rate: 0.522388
Iter number: 400, Error rate: 0.328358
Iter number: 500, Error rate: 0.328358
Iter number: 600, Error rate: 0.298507
Iter number: 700, Error rate: 0.283582
Iter number: 800, Error rate: 0.283582
Iter number: 900, Error rate: 0.298507
In [ ]:
 
posted @ 2020-04-14 15:45  早起的虫儿去吃鸟  阅读(203)  评论(0编辑  收藏  举报