基于概率论的分类方法:朴素贝叶斯——使用朴素贝叶斯进行文档分类
前言
之前讨论过的k-近邻算法和决策树都是结果确定的分类算法,今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类,或者只能给出数据实例属于给定分类的概率。
嘤嘤语录:朴素贝叶斯解决的问题是,今天下雨的概率问题,你需要根据概率确定今天要不要带伞。
说明:从本章开始,将不提供完整代码,只提供某个算法对应的代码块。
需求
以各大社交媒体为例,我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。
步骤
1.准备数据
1 def loadDataSet(): 2 postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], 3 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], 4 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], 5 ['stop', 'posting', 'stupid', 'worthless', 'garbage'], 6 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], 7 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] 8 classVec = [0,1,0,1,0,1] #1 is abusive, 0 not 9 return postingList,classVec 10 11 def createVocabList(dataSet): 12 vocabSet = set([]) #create empty set 13 for document in dataSet: 14 vocabSet = vocabSet | set(document) #union of the two sets 15 return list(vocabSet) 16 17 def setOfWords2Vec(vocabList, inputSet): 18 returnVec = [0]*len(vocabList) 19 for word in inputSet: 20 if word in vocabList: 21 returnVec[vocabList.index(word)] = 1 22 else: print "the word: %s is not in my Vocabulary!" % word 23 return returnVec
函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合,classVec是一个类别标签的集合。
函数createVocabList(dataSet)创建一个包含在文档中出现的不重复词的列表,词汇表。
函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量,并将其元素都设置为0.
接着,遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1.
打开IDE,我们进一步熟悉一下刚才的三个函数:
>>> import bayes >>> listOPosts,listClasses = bayes.loadDataSet() >>> listOPosts [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] >>> listClasses [0, 1, 0, 1, 0, 1]
>>> myVocabList = bayes.createVocabList(listOPosts) >>> myVocabList ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
发现现在没有出现重复的单词
>>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
myVocabList
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying',
'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
listOPosts[0]
['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']
2.训练算法
1 def trainNB0(trainMatrix,trainCategory): 2 numTrainDocs = len(trainMatrix) #6 3 numWords = len(trainMatrix[0]) #32 4 pAbusive = sum(trainCategory)/float(numTrainDocs) #3/6.0 5 p0Num = zeros(numWords); p1Num = zeros(numWords) #change to ones() 6 p0Denom = 0.0; p1Denom = 0.0 #change to 2.0 7 for i in range(numTrainDocs): # 0 1 2 3 4 5 6 8 if trainCategory[i] == 1: 9 p1Num += trainMatrix[i] 10 p1Denom += sum(trainMatrix[i]) 11 else: 12 p0Num += trainMatrix[i] 13 p0Denom += sum(trainMatrix[i]) 14 p1Vect = (p1Num/p1Denom) #change to log() 15 p0Vect = (p0Num/p0Denom) #change to log() 16 return p0Vect,p1Vect,pAbusive
trainCategory
[0, 1, 0, 1, 0, 1]
trainMat
[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]
>>> for postinDoc in listOPosts:
trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))
>>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)
>>> p0v array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.08333333, 0. , 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0.125 ]) >>> p1v array([ 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.15789474, 0. , 0.05263158, 0. , 0. , 0. ])
pab=0.5,说明文档属于侮辱类的概率是0.5。一共输入了6句话,其中3句是侮辱性言论,因此侮辱性言论的概率是0.5
嘤嘤语录,前面处理数据的方式,可以看成是把
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类
第一类是0的
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
分别计算每行在字典出现的次数/除以总的小数据量24
(关于在字典里出现的次数的理解:看到一个单词去字典查阅,有就标记一下,tag随查阅到的字数的增加而增加)
([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
0., 1., 0., 1., 1., 3.])
同理,对于标签为1的侮辱性
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
查阅字典后,得到的是
([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
3., 0., 1., 0., 0., 0.])
分别计算每行在字典出现的次数/除以总的小数据量19
这样理解一下,思路就清晰多了
为符合实际情况,我们把所有词出现的次数初始化为1,并将分母初始化为2,为方便计算,我们定义概率为log(p)
p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log()
朴素贝叶斯分类函数
1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): 2 p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult 3 p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) 4 if p1 > p0: 5 return 1 6 else: 7 return 0 8 9 def testingNB(): 10 listOPosts,listClasses = loadDataSet() 11 myVocabList = createVocabList(listOPosts) 12 trainMat=[] 13 for postinDoc in listOPosts: 14 trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) 15 p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses)) 16 testEntry = ['love', 'my', 'dalmation'] 17 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 18 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb) 19 testEntry = ['stupid', 'garbage'] 20 thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) 21 print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
>>> reload(bayes) <module 'bayes' from 'D:\Python27\bayes.pyc'> >>> bayes.testingNB() ['love', 'my', 'dalmation'] classified as: 0 ['stupid', 'garbage'] classified as: 1
文档词袋模型
def bagOfWords2VecMN(vocabList, inputSet): returnVec = [0]*len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] += 1 return returnVec
和setOfWords2Vec()几乎完全相同,唯一不同的是当每遇到一个单词,就会增加向量中的对应值,而不仅是将对应的数值设为1.
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步