1 4

朴素贝叶斯

这篇文章将利用朴素贝叶斯分类对文档进行分类。

从文本中获取特征,需要先拆分文本,下面的代码直接创建词条向量形式的文本作为训练数据,函数有两个返回值,分别是训练数据和每条数据对应的类别组成的列表:

def loadDataSet():
    # postingList 为进行词条切分后的文档集合
    postingList = [
        ['my','dog','has','flea','problems','help','please'],
        ['maybe','not','take','him','to','dog','park','stupid'],
        ['my','dalmation','is','so','cute','I','love','him'],
        ['stop','posting','stupid','worthless','garbage'],
        ['mr','licks','ate','my','steak','how','to','stop','him'],
        ['quit','buying','worthless','dog','food','stupid']
    ]
    classVec = [0,1,0,1,0,1]  # 类别标签集合
    return postingList,classVec

 接着创建一个包含在所有文档中出现的不重复词的词汇表:

# 返回包含在所有文档中出现的不重复词的列表
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)   #转化为列表形式返回

 下面是对训练数据进行处理的函数,输入为词汇表和某个文档,输出为文档向量,向量的元素为0或者1,代表该词在该输入文档中是否出现:1代表出现 0代表未出现

def setOfWords2Vec(vocabList,inputSet):  #vocabList 用于对照的词汇表 inputSet 用于检查的文档
    returnVec = [0] * len(vocabList)     # 默认为0 (即所有单词都不出现在inputSet文档中)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1  #设置该单词出现
    return returnVec

 接着是朴素贝叶斯训练函数:

def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)  #训练样本的总数量
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory) / numTrainDocs   #类别为1的数据出现的概率
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0DeNom = 2.0
    p1DeNom = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]  #类别为1:各个词分别出现的总次数
            p1DeNom += sum(trainMatrix[i])  # 类别为1: 单词总数
        else:
            p0Num += trainMatrix[i]  #类别为0:各个词分别出现的总次数
            p0DeNom += sum(trainMatrix[i])  # 类别为0: 单词总数
    p1Vect = p1Num/p1DeNom
    p0Vect = p0Num/p0DeNom
    return  p0Vect, p1Vect, pAbusive

# vec2Classify->要分类的变量  p0Vec->P(word|0)  p1Vec->P(word|1)  pClass1->1类出现的概率
def classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):
    p1 = sum(vec2Classify * p1Vec) * pClass1
    p0 = sum(vec2Classify * p0Vec) * (1 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

 下面函数用于测试分类器的效果:

def testingNB():
    listOPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)  # 词汇表
    trainMat = []
    for postInDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postInDoc))
    p0V, p1V, pAb = trainNB0(trainMat, listClasses)
    testEntry = ['love','my','dalmation']
    thisDoc = np.array(setOfWords2Vec(myVocabList,testEntry))
    print(testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
    testEntry2 = ['stupid','garbage']
    thisDoc2 = np.array(setOfWords2Vec(myVocabList,testEntry2))
    print(testEntry2,'classified as: ', classifyNB(thisDoc2,p0V,p1V,pAb))

调用函数,输出结果为:

['love', 'my', 'dalmation'] classified as: 0

['stupid', 'garbage'] classified as: 1 

 

posted @ 2017-09-09 21:08  韦木三  阅读(122)  评论(0编辑  收藏  举报