背景:以在线社区的留言板为例,为了不影响社区的发展,我们需要屏蔽侮辱性的言论,所以要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。过滤这类内容是一个很常见的需求,对此问题建立两个类别:侮辱类和非侮辱类,使用0和1分别表示。
接下来首先给出将文本转换为数字向量的过程,然后介绍如何基于这些向量来计算条件概率,并在此基础上构建分类器。创建一个bayes.py的新文件
1、准备数据:从文本中构建词向量
要从文本中获取特征,需要先拆分文本,这里的特征来自文本的词条(token),一个词条是字符的任意组合。可以把词条想象为单词,也可以使用非单词词条,如:URL、IP地址或者任意其他字符串。然后将每一个文本片段表示为一个词条向量,其中值为1表示词条出现在文档,0表示词条未出现。
我们将把文本看成单词向量或者词条向量,也就是说将句子抓换为向量。考虑出现在所有文档中的所有单词,再决定将那些词纳入词汇表或者说所要的词汇集合,然后必须要将每一篇文档转换为词汇表上的向量。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #词表到向量的转换函数 #!/usr/bin/python # -*- coding: utf-8 -*- from numpy import * def loadDataSet(): postingList = [[ 'my' , 'dog' , 'has' , 'flea' , 'problem' , 'help' , 'please' ],\ [ 'maybe' , 'not' , 'take' , 'him' , 'to' , 'dog' , 'park' , 'stupid' ],\ [ 'my' , 'dalmation' , 'is' , 'so' , 'cute' , 'I' , 'love' , 'him' ],\ [ 'stop' , 'posting' , 'stupid' , 'worthless' , 'garbage' ],\ [ 'mr' , 'licks' , 'ate' , 'my' , 'steak' , 'how' , 'to' , 'stop' , 'him' ],\ [ 'quit' , 'buying' , 'worthless' , 'dog' , 'food' , 'stupid' ]] classVec = [ 0 , 1 , 0 , 1 , 0 , 1 ] #1代表侮辱性文字,0代表正常言论 return postingList,classVec #创建一个包含在所有文档中出现的不重复词的列表 def createVocabList(dataSet): vocabSet = set ([]) #创建一个空集 for document in dataSet: vocabSet = vocabSet| set (document) #创建两个集合的并集,set会返回一个不重复词表 return list (vocabSet) #该函数输入参数为词汇表及其某个文档,输出是文档向量 def setOfWords2Vec(vocabList,inputSet): returnVec = [ 0 ] * len (vocabList) #创建一个和词汇表等长的向量,并将其元素都设置为0 for word in inputSet: #遍历输入文档中所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1 if word in inputSet: returnVec[vocabList.index(word)] = 1 else : print "the word:%s is not in my Vocabulary!" % word return returnVec |
保存bayes.py文件,然后在python提示符下输入:
>>> import bayes >>> listOPosts,listClasses=bayes.loadDataSet() >>> myVocabList=bayes.createVocabList(listOPosts) >>> myVocabList ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'stop', 'is', 'park', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'problem', 'steak', 'my'] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[0]) [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[1]) [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[2]) [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[3]) [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[4]) [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1] >>> bayes.setOfWords2Vec(myVocabList,listOPosts[5]) [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
2、训练算法:从词向量计算概率
前面介绍了如何将一组单词转换为一组数字,接下来看看如何使用这些数字计算概率。现在已经知道一个词是否出现在一篇文档中,也知道该文档所属类别。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #朴素贝叶斯分类器训练函数 def trainNBO(trainMatrix,trainCategory): numTrainDocs = len (trainMatrix) numWords = len (trainMatrix[ 0 ]) pAbusive = sum (trainCategory) / float (numTrainDocs) p0Num = zeros(numWords);p1Num = zeros(numWords) p0Demo = 0.0 ;p1Demo = 0.0 #初始化概率 for i in range (numTrainDocs): if trainCategory[i] = = 1 : p1Num + = trainMatrix[i] p1Demo + = sum (trainMatrix[i]) else : p0Num + = trainMatrix[i] p0Demo + = sum (trainMatrix[i]) p1Vect = p1Num / p1Demo p0Vect = p0Num / p0Demo return p0Vect,p1Vect,pAbusive |
将上述代码添加到bayes.py文件中,在python提示符下输入:
>>> reload(bayes) <module 'bayes' from 'bayes.py'> >>> listOPosts,listClass=bayes.loadDataSet() >>> listOPosts [['my', 'dog', 'has', 'flea', 'problem', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] >>> listClasses [0, 1, 0, 1, 0, 1] >>> myVocabList=bayes.createVocabList(listOPosts) >>> trainMat=[] >>> trainMat [[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]] >>> p0V,p1V,pAb=bayes.trainNBO(trainMat,listClasses)
>>> pAb
0.5
>>> p0V #求的是p(w|C0),0.04166667=(cute在非侮辱文档0、2、4中cute出现的总次数)/(非侮辱文档中的总词条数)=1/24)
array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. ,
0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667,
0.04166667, 0.04166667, 0. , 0. , 0.08333333,
0. , 0. , 0.04166667, 0. , 0.04166667,
0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667,
0. , 0.04166667, 0. , 0.04166667, 0.04166667,
0.04166667, 0.125 ])
>>> p1V #求的是p(w|C1),0.15789474=(stupid在侮辱文档1、3、5中stupid出现的总次数)/(侮辱文档中的总词条数)=3/19)
array([ 0. , 0. , 0. , 0.05263158, 0.05263158,
0. , 0.05263158, 0. , 0.05263158, 0. ,
0. , 0. , 0.05263158, 0.05263158, 0.05263158,
0.05263158, 0.05263158, 0. , 0.10526316, 0. ,
0.05263158, 0.05263158, 0. , 0.10526316, 0. ,
0.15789474, 0. , 0.05263158, 0. , 0. ,
0. , 0. ])
解释:
>>> numTrainDocs=len(trainMat) >>> numTrainDocs 6 >>> numWords=len(trainMat[0]) >>> numWords 32 >>> from numpy import * >>> p0Num=zeros(numWords);p1Num=zeros(numWords);p0Demo=0.0;p1Demo=0.0 >>> p0Num array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) >>> for i in range(numTrainDocs): ... if listClasses[i]==1: ... p1Num+=trainMat[i] ... p1Demo+=sum(trainMat[i]) ... else: ... p0Num+=trainMat[i] ... p0Demo+=sum(trainMat[i]) ... >>> p0Num array([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 3.]) #trainMat的第一行、三行、五行非侮辱性文档词向量相加 >>> p0Demo 24.0 >>> p1Num array([ 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0., 3., 0., 1., 0., 0., 0., 0.]) #trainMat的第二行、四行、六行非侮辱性文档词向量相加 >>> p1Demo 19.0
3、测试算法:根据现实情况修改分类器
利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,即计算P(w0|1)P(w1|1)P(w2|1)...,如果其中一个概率值为0,那么最后的乘积也为0。为降低这种影响,可以将所有词的出现次数初始化为1,并将分母初始化为2。修改trainNBO()的相应位置的代码:
1 2 | p0Num = ones(numWords);p1Num = ones(numWords) #计算p(w0|1)p(w1|1),避免其中一个概率值为0,最后的乘积为0 p0Demo = 2.0 ;p1Demo = 2.0 #初始化概率 |
1 2 | p1Vect = log(p1Num / p1Demo) #计算p(w0|1)p(w1|1)时,大部分因子都非常小,程序会下溢出或得不到正确答案(相乘许多很小数,最后四舍五入会得到0) p0Vect = log(p0Num / p0Demo) |
运行后:
>>> reload(bayes) <module 'bayes' from 'bayes.py'> >>> listOPosts,listClass=bayes.loadDataSet() >>> listOPosts [['my', 'dog', 'has', 'flea', 'problem', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] >>> listClasses [0, 1, 0, 1, 0, 1] >>> myVocabList=bayes.createVocabList(listOPosts) >>> trainMat=[] >>> trainMat [[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]] >>> p0V,p1V,pAb=bayes.trainNBO(trainMat,listClasses)
>>> pAb 0.5 >>> p0V #求的是p(w|C0),-2.56494936=log((cute在非侮辱文档0、2、4中cute出现的总次数)/(非侮辱文档中的总词条数))=log(2/26)) array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
-2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
-2.56494936, -2.56494936, -3.25809654, -3.25809654, -2.15948425,
-3.25809654, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
-2.56494936, -3.25809654, -2.56494936, -2.56494936, -2.56494936,
-3.25809654, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
-2.56494936, -1.87180218]) >>> p1V #求的是p(w|C1),-1.65822808=log((stupid在侮辱文档1、3、5中stupid出现的总次数)/(侮辱文档中的总词条数))=log(4/21)) array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
-3.04452244, -2.35137526, -3.04452244, -2.35137526, -3.04452244,
-3.04452244, -3.04452244, -2.35137526, -2.35137526, -2.35137526,
-2.35137526, -2.35137526, -3.04452244, -1.94591015, -3.04452244,
-2.35137526, -2.35137526, -3.04452244, -1.94591015, -3.04452244,
-1.65822808, -3.04452244, -2.35137526, -3.04452244, -3.04452244,
-3.04452244, -3.04452244])
解释:
>>> numTrainDocs=len(trainMat) >>> numTrainDocs 6 >>> numWords=len(trainMat[0]) >>> numWords 32 >>> from numpy import * >>> p0Num=ones(numWords);p1Num=ones(numWords);p0Demo=2.0;p1Demo=2.0 >>> p0Num array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) >>> for i in range(numTrainDocs): ... if listClasses[i]==1: ... p1Num+=trainMat[i] ... p1Demo+=sum(trainMat[i]) ... else: ... p0Num+=trainMat[i] ... p0Demo+=sum(trainMat[i]) ... >>> p0Num array([ 2., 2., 2., 1., 1., 2., 2., 2., 1., 2., 2., 2., 1., 1., 3., 1., 1., 2., 1., 2., 2., 1., 2., 2., 2., 1., 2., 1., 2., 2., 2., 4.]) >>> p0Demo 26.0 >>> sum(p0Num) 56.0 >>> p1Num array([ 1., 1., 1., 2., 2., 1., 2., 1., 2., 1., 1., 1., 2., 2., 2., 2., 2., 1., 3., 1., 2., 2., 1., 3., 1., 4., 1., 2., 1., 1., 1., 1.]) >>> p1Demo 21.0 >>> sum(p1Num) 51.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | #朴素贝叶斯分类函数 def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1): #vec2Classify表示要分类的向量 p1 = sum (vec2Classify * p1Vec) + log(pClass1) #这里的相乘是指对应元素相乘,即先将两个向量中的第1个元素相乘,然后将第2个元素相乘,以此类推,接下来将词汇表中所有词的对应值相加,然后将该值家到类别的对数频率上。 p0 = sum (vec2Classify * p0Vec) + log( 1.0 - pClass1) if p1>p0: return 1 else : return 0 def testingNB(): listOPosts,listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postinDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList,postinDoc)) p0V,p1V,pAb = trainNBO(array(trainMat),array(listClasses)) testEntry = [ 'love' , 'my' , 'dalmation' ] thisDoc = array(setOfWords2Vec(myVocabList,testEntry)) print testEntry, 'classified as:' ,classifyNB(thisDoc,p0V,p1V,pAb) testEntry = [ 'stupid' , 'garbage' ] thisDoc = array(setOfWords2Vec(myVocabList,testEntry)) print testEntry, 'classified as:' ,classifyNB(thisDoc,p0V,p1V,pAb) |
将上述代码添加到bayes.py文件中,在python提示符下输入:
>>> reload(bayes) <module 'bayes' from 'bayes.pyc'> >>> bayes.testingNB() ['love', 'my', 'dalmation'] classified as: 0 ['stupid', 'garbage'] classified as: 1
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通