朴素贝叶斯和逻辑回归分类
朴素贝叶斯
查看例子:
----------------------------------------------------------------------------------------------------------------------------------------------------
用p1(x, y)表示(x, y)属于类别1的概率,P2(x, y)表示(x, y)属于类别2的概率;
如果p(c1|x, y) > P(c2|x, y), 那么类别为1
如果p(c1|x, y) < P2(c2|x, y), 那么类别为2
根据贝叶斯公式:
p(c|x, y) = (p(x, y|c) * p(c)) / p(x, y)
(x, y)表示要分类的特征向量, c表示类别
因为p(x, y),对不同类别的数值是一样的,只需计算p(x, y|c) 和 p(c)
p(c)根据样本数据的类别,容易计算出来
p(x, y|c), 需要先计算每个类别下训练样本的特征出现的概率
根据测试样本,计算特征向量,再计算与训练好的特征概率的点积,即可。
样本数据中,每个类别中每个项目在总的词典中出现的概率。
1 'hello word' 0
2 'this is your problem' 0
3 'dont do is that' 1
一共3个项目,2个类别
词典是['hello', 'word', 'this', 'is', 'your', 'problem', 'dont', 'do', that], 注意去掉重复的单词
项目1的特征向量是【1, 1, 0, 0, 0, 0, 0, 0, 0】 即为特征向量p(x,y)
项目2的特征向量是【0, 0, 1, 1, 1, 1, 0, 0, 0】
项目3的特征向量是【0, 0, 0, 1, 0, 0,1, 1, 1】
类别0的特征向量
项目1 + 项目2 = 【1, 1, 1, 1, 1, 1, 0, 0,0】
sum(项目1) +sum(项目2) =6
p(x, y|c0) = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6, 0, 0,0]
p(c0) = 2/3
类别1的特征向量
p(x, y|c1) = 【0, 0, 0, 1/4, 0, 0,1/4, 1/4, 1/4】
p(c1) = 1/3
注意实际计算的时候,
p(c|x,y) = p(x, y|c) * p(x,y) + log(p(c))
实例, 垃圾邮件过滤
观察代码:
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Sat May 02 21:52:08 2015 4 5 @author: silingxiao 6 """ 7 from numpy import * 8 9 def loadDataSet(): 10 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], 11 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], 12 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], 13 ['stop', 'posting', 'stupid', 'worthless', 'garbage'], 14 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], 15 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] 16 classVec = [0,1,0,1,0,1] #1 is abusive, 0 not 17 return postingList,classVec 18 19 def createVocabList(dataSet): 20 vocabSet = set([]) #create empty set 21 for document in dataSet: 22 vocabSet = vocabSet | set(document) #union of the two sets 23 return list(vocabSet) 24 25 def setOfWords2Vec(vocabList, inputSet): 26 returnVec = [0]*len(vocabList) 27 for word in inputSet: 28 if word in vocabList: 29 returnVec[vocabList.index(word)] = 1 30 else: print "the word: %s is not in my Vocabulary!" % word 31 return returnVec 32 33 def trainNB0(trainMatrix,trainCategory): 34 numTrainDocs = len(trainMatrix) 35 numWords = len(trainMatrix[0]) 36 pAbusive = sum(trainCategory)/float(numTrainDocs) 37 p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() 38 p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 39 for i in range(numTrainDocs): 40 if trainCategory[i] == 1: 41 p1Num += trainMatrix[i] 42 p1Denom += sum(trainMatrix[i]) 43 else: 44 p0Num += trainMatrix[i] 45 p0Denom += sum(trainMatrix[i]) 46 p1Vect = log(p1Num/p1Denom) #change to log() 47 p0Vect = log(p0Num/p0Denom) #change to log() 48 return p0Vect,p1Vect,pAbusive 49 50 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): 51 p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult 52 p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) 53 if p1 > p0: 54 return 1 55 else: 56 return 0 57 58 59 def testingNB(): 60 listOPosts,listClasses = loadDataSet() 61 myVocabList = createVocabList(listOPosts) 62 trainMat=[] 63 for postinDoc in listOPosts: 64 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) 65 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) 66 testEntry = ['love', 'my', 'dalmation'] 67 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 68 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) 69 testEntry = ['stupid', 'garbage'] 70 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 71 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) 72 73 postingList, classVec = loadDataSet() 74 vocabSet = createVocabList(postingList) 75 trainMat = [] 76 for postinDoc in postingList: 77 trainMat.append(setOfWords2Vec(vocabSet, postinDoc)) 78 79 p0v, p1v, pAb = trainNB0(trainMat, classVec) 80 81 82 testingNB()
运行结果如下:
逻辑回归分类
属于广义线性回归中的一个特殊类别,主要用于分类;
模型采用连接函数
为sigmoid函数,和阶梯函数有着类似的性质,但要求该函数一阶可微;
采用梯度上升或下降计算模型的参数。
或者
梯度方向为
Python代码分类两类点
目标,找出一条直线分割两类点,也就是求出模型的系数,采用梯度下降方法gradAscent或者优化的随机梯度下降算法gradAscent1
注意,y = w0x0 + w1x1 + w2x2, 另 y = 0, 求出x2 和 x1的函数关系。具体的解释见下方
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Sun May 03 10:22:21 2015 4 5 @author: silingxiao 6 """ 7 from numpy import * 8 9 def loadDataSet(): 10 dataMat = []; labelMat = [] 11 fr = open('testSet.txt') 12 for line in fr.readlines(): 13 lineArr = line.strip().split() 14 dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) 15 labelMat.append(int(lineArr[2])) 16 return dataMat, labelMat 17 18 def sigmoid(inx): 19 return 1.0/(1 + exp(-inx)) 20 21 def gradAscent(dataMatIn, classLabels): 22 dataMatrix = mat(dataMatIn) #convert to NumPy matrix 23 labelMat = mat(classLabels).transpose() #convert to NumPy matrix 24 m,n = shape(dataMatrix) 25 alpha = 0.001 26 maxCycles = 500 27 weights = ones((n,1)) 28 for k in range(maxCycles): #heavy on matrix operations 29 h = sigmoid(dataMatrix*weights) #matrix mult 30 error = (labelMat - h) #vector subtraction 31 weights = weights + alpha * dataMatrix.transpose()* error #matrix mult 32 return weights 33 34 def plotBestFit(weights): 35 import matplotlib.pyplot as plt 36 dataMat,labelMat=loadDataSet() 37 dataArr = array(dataMat) 38 n = shape(dataArr)[0] 39 xcord1 = []; ycord1 = [] 40 xcord2 = []; ycord2 = [] 41 for i in range(n): 42 if int(labelMat[i])== 1: 43 xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) 44 else: 45 xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) 46 fig = plt.figure() 47 ax = fig.add_subplot(111) 48 ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') 49 ax.scatter(xcord2, ycord2, s=30, c='green') 50 x = arange(-3.0, 3.0, 0.1) 51 y = (-weights[0]-weights[1]*x)/weights[2] 52 #ax.plot(x, y) 53 plt.xlabel('X1'); plt.ylabel('X2'); 54 plt.show() 55 56 57 def stocGradAscent1(dataMatrix, classLabels, numIter=150): 58 m,n = shape(dataMatrix) 59 weights = ones(n) #initialize to all ones 60 for j in range(numIter): 61 dataIndex = range(m) 62 for i in range(m): 63 alpha = 4/(1.0+j+i)+0.0001 #apha decreases with iteration, does not 64 randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant 65 h = sigmoid(sum(dataMatrix[randIndex]*weights)) 66 error = classLabels[randIndex] - h 67 weights = weights + alpha * error * dataMatrix[randIndex] 68 del(dataIndex[randIndex]) 69 return weights 70 71 def classifyVector(inX, weights): 72 prob = sigmoid(sum(inX*weights)) 73 if prob > 0.5: return 1.0 74 else: return 0.0 75 76 def colicTest(): 77 frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt') 78 trainingSet = []; trainingLabels = [] 79 for line in frTrain.readlines(): 80 currLine = line.strip().split('\t') 81 lineArr =[] 82 for i in range(21): 83 lineArr.append(float(currLine[i])) 84 trainingSet.append(lineArr) 85 trainingLabels.append(float(currLine[21])) 86 trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) 87 errorCount = 0; numTestVec = 0.0 88 for line in frTest.readlines(): 89 numTestVec += 1.0 90 currLine = line.strip().split('\t') 91 lineArr =[] 92 for i in range(21): 93 lineArr.append(float(currLine[i])) 94 if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]): 95 errorCount += 1 96 errorRate = (float(errorCount)/numTestVec) 97 print "the error rate of this test is: %f" % errorRate 98 return errorRate 99 100 def multiTest(): 101 numTests = 10; errorSum=0.0 102 for k in range(numTests): 103 errorSum += colicTest() 104 print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)) 105 106 107 if __name__ == '__main__': 108 dataMat, labelMat = loadDataSet() 109 weights = gradAscent(dataMat, labelMat) 110 plotBestFit(weights.getA()) 111 #multiTest()
运行结果如下: