机器学习实战读书笔记(四)基于概率论的分类方法:朴素贝叶斯
4.1 基于贝叶斯决策理论的分类方法
朴素贝叶斯
优点:在数据较少的情况下仍然有效,可以处理多类别问题
缺点:对于输入数据的准备方式较为敏感
适用数据类型:标称型数据
贝叶斯决策理论的核心思想:选择具有最高概率的决策。
4.2 条件概率
4.3 使用条件概率来分类
4.4 使用朴素贝叶斯进行文档分类
朴素贝叶斯的一般过程:
1.收集数据
2.准备数据
3.分析数据
4.训练算法
5.测试算法
6.使用算法
朴素贝叶斯分类器中的另一个假设是,每个特征同等重要。
4.5 使用Python进行文本分类
4.5.1 准备数据:从文本中构建词向量
建立bayes.py文件
def loadDataSet(): postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] #1 is abusive, 0 not return postingList,classVec def createVocabList(dataSet): vocabSet = set([]) #create empty set for document in dataSet: vocabSet = vocabSet | set(document) #union of the two sets return list(vocabSet) def setOfWords2Vec(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print "the word: %s is not in my Vocabulary!" % word return returnVec
import bayes listOPosts,listClasses=bayes.loadDataSet() # myVocabList=bayes.createVocabList(listOPosts) bayes.setOfWords2Vec(myVocabList,listOPosts[0]) bayes.setOfWords2Vec(myVocabList,listOPosts[3])
4.5.2 训练算法,从词向量计算概率
改写贝叶斯,使用以下公式:
w为向量,p(w|ci)可以展开为p(w0,w1...wN|ci),假设所有词相互独立 ,那么该假设也称作条件独立性假设,这表示可以使用p(w0|ci)p(w1|ci)...p(wn|ci)计算上述概率。
该函数伪代码如下:
计算每个类别中的文档数目
对每篇训练文档:
对每个类别:
如果词条出现在文档中->增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
返回每个类别的条件概率
def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = zeros(numWords); p1Num = zeros(numWords) #change to ones() p0Denom = 0.0; p1Denom = 0.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = p1Num/p1Denom #change to log() p0Vect = p0Num/p0Denom #change to log() return p0Vect,p1Vect,pAbusive
trainMat=[] for postinDoc in listOPosts: trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc)) p0V,p1V,pAb=bayes.trainNB0(trainMat,listClasses)
4.5.3 测试算法:根据现实情况修改分类器
利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,即计算p(w0|1)p(w1|1)...,如果其中一个概率值为0,那么最后乘积也为0。为降低这种影响,可以将所有词的出现数初始化为1,并将分母初始化为2.
修改TrainNB0()
def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = p1Num/p1Denom #change to log() p0Vect = p0Num/p0Denom #change to log() return p0Vect,p1Vect,pAbusive
另一个遇到的问题是下溢出,这是由于太多很小的数相乘造成的。当计算p(w0|1)p(w1|1)...时,由于大部分因子都很小,所以程序会下溢出或得到不正确的答案。一种解决办法是对乘积取自然对数。在代数中有ln(a*b)=ln(a)+ln(b),于是通过求对数可以避免下溢出或者浮点数舍入导致的错误。同时,采用自然对数进行处理不会有任何损失。因此,修改TrainNB0
def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log() return p0Vect,p1Vect,pAbusive
编写分类函数
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0 def bagOfWords2VecMN(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 return returnVec def testingNB(): listOPosts,listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat=[] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) testEntry = ['love', 'my', 'dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) testEntry = ['stupid', 'garbage'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
bayes.testingNB()
4.5.4 准备数据:文档词袋模型
词集模型:每个词出现一次。
词袋模型:每个词在文档中出现不止一次。
把setOfWords2Vec()改为bagOfWords2Vec()
def bagOfWords2VecMN(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 return returnVec
4.6 使用朴素贝叶斯过滤垃圾邮件
1.收集数据:提供文本文件
2.准备数据:将文本文件解析成词条向量
3.分析数据:检查词条确保解析的正确性
4.训练算法:使用我们之前建立的trainNB0()函数
5.测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率
6.使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
4.6.1 准备数据:切分文本
4.6.2 测试算法:使用朴素贝叶斯进行交叉验证
def textParse(bigString): #input is big string, #output is word list import re listOfTokens = re.split(r'\W*', bigString) return [tok.lower() for tok in listOfTokens if len(tok) > 2] def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): wordList = textParse(open('email/spam/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(1) wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList)#create vocabulary trainingSet = range(50); testSet=[] #create test set for i in range(10): randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat=[]; trainClasses = [] for docIndex in trainingSet:#train the classifier (get probs) trainNB0 trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex])) trainClasses.append(classList[docIndex]) p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses)) errorCount = 0 for docIndex in testSet: #classify the remaining items wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 print "classification error",docList[docIndex] print 'the error rate is: ',float(errorCount)/len(testSet) #return vocabList,fullText
以上程序,随机选择10篇作测试集,如果全部判对输出错误率0.0,若有错误则输出错分文档的词表。
4.7 未完成