logstic 回归
一天,某人问我什么是logstic回归。虽然做数据分析这么长时间经常用,仅仅是import some * 而已,没有深入思考,然而很遗憾,我在网上看到的logstic回归的数学推导都是错的,包括几本机器学习的经典教科书。花了几天时间推导一下,发现其背后的数学思想比较复杂,涉及到矩阵点乘和矩阵微分的概念
logstic回归就是对p/(1-p)进行线性回归)
from numpy import *
def loadDataSet():
dataMat = []; labelMat = []
fr = open('testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat,labelMat
def sigmoid(inX):
return 1.0/(1+exp(-inX))
def gradAscent(dataMatIn, classLabels):
dataMatrix = mat(dataMatIn) #convert to NumPy matrix
labelMat = mat(classLabels).transpose() #convert to NumPy matrix
m,n = shape(dataMatrix)
alpha = 0.001
maxCycles = 5000
weights = ones((n,1))
for k in range(maxCycles): #heavy on matrix operations
h = sigmoid(dataMatrix*weights) #matrix mult
error = (labelMat - h) #vector subtraction
weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
return weights
def plotBestFit(weights):
import matplotlib.pyplot as plt
dataMat,labelMat=loadDataSet()
dataArr = array(dataMat)
n = shape(dataArr)[0]
xcord1 = []; ycord1 = []
xcord2 = []; ycord2 = []
for i in range(n):
if int(labelMat[i])== 1:
xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
ax.scatter(xcord2, ycord2, s=30, c='green')
x = arange(-3.0, 3.0, 0.1)
y = (-weights[0]-weights[1]*x)/weights[2]
ax.plot(x, y)
plt.xlabel('X1'); plt.ylabel('X2');
plt.show()
#import logRegres
dataArr,labelMat=loadDataSet()
weights=gradAscent(dataArr,labelMat)
plotBestFit(weights.getA())
输出的weight=
matrix([[ 9.35184677],
[ 0.87401362],
[-1.28891422]])
xw=9.35+0.87x-1.28y
令9.35+0.87x-1.28y=0,这就是分类曲线,为什么要这么做,在logstic 回归中,在分类中以概率值0.5为分类界限,ln(p/1-p)=xw,p=0.5,得xw=0