【Machine Learning】Logistic 回归
logistic
¶
1. 梯度上升算法¶
In [1]:
import numpy as np
from matplotlib import pyplot as plt
1.1 sigmoid 函数: 一个能向两边收敛的函数, 用作Logistic回归的分类器函数¶
Sigmoid函数的输入记为z,由下面公式得出: 它表示将这两个数值向量对应元素相乘然后 全部加起来即得到z值。其中的向量x是分类器的输入数据,向量w也就是我们要找到的最佳参数 (系数),从而使得分类器尽可能地精确。
In [2]:
def sigmoid(intX):
return 1 / (1 + np.exp(-intX))
def plotSigmoid():
x = np.arange(-20.0, 20.0, 0.1)
y = sigmoid(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
plotSigmoid()
1.2 读取数据¶
In [3]:
def loadDataSet():
dataMat = []; labelMat = []
fr = open('testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat, labelMat
dataMat, labelMat = loadDataSet()
1.3 梯度上升算法¶
梯度上升算法的迭代公式如下: 该公式将一直被迭代执行,直至达到某个停止条件为止,比如迭代次数达到某个指定值或算 法达到某个可以允许的误差范围。
In [4]:
def gradAscent(dataMatIn, classLabels):
# 将数组转为矩阵
dataMatrix = np.mat(dataMatIn)
labelMatrix = np.mat(classLabels).transpose()
m, n = np.shape(dataMatrix)
alpha = 0.001 # 步长
maxCycles = 500 # 迭代次数
weights = np.ones((n, 1)) # 将系数都初始化为 1
# 进行迭代
for k in range(500):
h = sigmoid(dataMatrix * weights)
error = labelMatrix - h
weights = weights + alpha * dataMatrix.transpose() * error
return weights
1.4 测试梯度上升算法¶
In [5]:
data, labels = loadDataSet()
res = gradAscent(data, labels)
print(res)
print(type(res))
1.5 绘制分类边界¶
In [6]:
def plotBestFit(dataMat, labelMat, wei):
dataArr = np.array(dataMat)
n = np.shape(dataArr)[0]
xcord1 = []; ycord1 = [] # 存储分类为 1 的点
xcord2 = []; ycord2 = [] # 存储分类为 0 的点
for i in range(n):
if int(labelMat[i]) == 1:
xcord1.append(dataArr[i, 1])
ycord1.append(dataArr[i, 2])
else:
xcord2.append(dataArr[i, 1])
ycord2.append(dataArr[i, 2])
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(xcord1, ycord1, c='red', marker='s') # 绘制分类为 1 的点 红色方块
ax.scatter(xcord2, ycord2, c='green') # 绘制分类为 0 的点 绿色圆点
x = np.arange(-3.0, 3.0, 0.1)
y = (-wei[0] - wei[1] * x)/ wei[2]
ax.plot(x, y) # 绘制分类边界
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
plotBestFit(data, labels, res.getA())
2 随机梯度上升算法¶
2.1 算法实现和测试¶
梯度上升算法每次更新回归系数时都要遍历整个数据集,当数据量太大时,计算复杂度过高。
In [7]:
def stocGradAscent0(dataMatrix, classLabels):
m, n = np.shape(dataMatrix)
alpha = 0.01
weights = np.ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i] * weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
res = stocGradAscent0(np.array(data), labels)
print(res)
plotBestFit(np.array(data), labels, res)
2.2 随机梯度上升算法优化¶
随机梯度上升算法将 1.3 中的算法从矩阵计算简化为数值计算, 但是迭代时仍需要遍历每个样本,而且步长是固定值,在此基础上,我们可以在迭代过程中改变步长,并且随机抽取部分样本进行训练。
2.2.1 优化算法¶
In [8]:
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
m, n = np.shape(dataMatrix)
weights = np.ones(n)
for j in range(numIter):
dataIndex = list(range(m))
for i in dataIndex:
alpha = 4 / (1.0 + j + i) + 0.01 # 改变步长
randIndex = int(np.random.uniform(0, len(dataIndex))) # 随机抽取样本
h = sigmoid(np.sum(dataMatrix[randIndex] * weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights
2.2.2 算法测试¶
In [9]:
# 使用默认的迭代次数 150
res = stocGradAscent1(np.array(data), labels)
print(res)
plotBestFit(np.array(data), labels, res)
In [10]:
# 更改迭代次数为 200
res = stocGradAscent1(np.array(data), labels, 200)
print(res)
plotBestFit(np.array(data), labels, res)
3. 从疝气病症预测病马的死亡率¶
3.1 处理数据中的缺失值¶
常用方法:
使用可用特征的均值来填补缺失值;
使用特殊值来填补缺失值,如 1;
忽略有缺失值的样本;
使用相似样本的均值添补缺失值;
使用另外的机器学习算法预测缺失值。
3.2 读取数据¶
In [11]:
def loadDataSet(path):
dataMat = []; labelMat = []
fr = open(path)
for line in fr.readlines():
currentLine = line.strip().split('\t')
lineArr = []
length = len(currentLine)
for i in range(length - 1):
lineArr.append(float(currentLine[i]))
dataMat.append(lineArr)
labelMat.append(float(currentLine[length - 1]))
return dataMat, labelMat
trainData, trainLabel = loadDataSet('horseColicTraining.txt')
testData, testLabel = loadDataSet('horseColicTest.txt')
3.2 分类函数¶
In [12]:
def classifyVector(intX, weights):
prob = sigmoid(np.sum(intX * weights))
return 1.0 if prob > 0.5 else 0.0
def stocGradAscentClassify(trainDataMat, trainLabelMat, testDataMat, numIter=200):
weights = stocGradAscent1(np.array(trainDataMat), trainLabelMat, numIter)
predicts = []
for i in range(len(testDataMat)):
predicts.append(classifyVector(testDataMat[i], weights))
return predicts
res = stocGradAscentClassify(trainData, trainLabel, testData, 200)
print(res)
3.3 对比结果,计算错误率¶
In [13]:
def errorRates(testLabelMat, predictMat):
errorNum = 0
total = len(predictMat)
for i in range(total):
if testLabel[i] != predictMat[i]:
errorNum += 1
return float(errorNum) / total
print(errorRates(testLabel, res))
3.4 更改迭代次数,测试算法¶
In [14]:
def multiTest():
trainData, trainLabel = loadDataSet('horseColicTraining.txt')
testData, testLabel = loadDataSet('horseColicTest.txt')
for i in range(1,10):
numIter = i * 100
predicts = stocGradAscentClassify(trainData, trainLabel, testData, numIter)
print('Iter number: %d, Error rate: %f' % (numIter, errorRates(testLabel, predicts)))
multiTest()
In [ ]:
欢迎访问我的个人博客站点:
https://yeyeck.com