机器学习实战-朴素贝叶斯垃圾邮件分类

朴素贝叶斯

概念

对朴素贝叶斯的概念存在疑惑的，可以依此理解条件概率，全概率公式和贝叶斯公式。

附链接帮助理解：

链接1https://blog.csdn.net/Hearthougan/article/details/75174210

链接2https://www.cnblogs.com/hellcat/p/7195843.html

朴素贝叶斯分类是一种十分简单的分类算法，叫它朴素贝叶斯分类是因为这种方法的思想真的很朴素，朴素贝叶斯的思想基础是这样的：对于给出的待分类项，求解在此项出现的条件下各个类别出现的概率，哪个最大，就认为此待分类项属于哪个类别。

实战

此实例为"朴素贝叶斯实现垃圾邮件分类"。

数据集样例：

Data->normal文件夹下数据样例：

200.txt:

Return-Path: <cai@tsinghua.edu.cn>

Received: from mail.tsinghua.edu.cn (mail.tsinghua.edu.cn [166.111.8.18])

by home.ccert.edu.cn (8.13.1/8.13.1) with SMTP id i9S1aCPt007420

for <jiang@ccert.edu.cn>; Thu, 28 Oct 2004 09:36:12 +0800

Received: (eyou send program); Thu, 28 Oct 2004 09:33:01 +0800

Message-ID: <298927181.07940@mail.tsinghua.edu.cn>

Received: from unknown (HELO mail.tsinghua.edu.cn) (unknown@127.0.0.1)

by 127.0.0.1 with SMTP; Thu, 28 Oct 2004 09:33:01 +0800

X-scanvirus: By Symantec Scan Engine

X-scanresult: CLEAN

Received: (eqmail ); 28 Oct 2004 01:32:52 -0000

Received: from unknown (HELO sony) (duanhx@202.112.50.6)

by localhost with SMTP; 28 Oct 2004 01:32:52 -0000

Message-ID: <009f01c4bc90$12b41200$c63270ca@sony>

From: "cai" <cai@tsinghua.edu.cn>

To: jiang@ccert.edu.cn

Subject: =?gb2312?B?ofEgW7T6t6Jd1dDGuMTayN2x4Lyt0rvD+6Osu7bTrc3GvPY=?=

Date: Thu, 28 Oct 2004 09:47:18 +0800

MIME-Version: 1.0

Content-Type: text/plain;

charset="gb2312"

X-Priority: 3

X-MSMail-Priority: Normal

X-Mailer: Microsoft Outlook Express 6.00.2900.2180

X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180

Content-Transfer-Encoding: 8bit

X-MIME-Autoconverted: from base64 to 8bit by home.ccert.edu.cn id i9S1aCPt007420

X-UIDL: 0~p"!(4g!!K0I!!][`!!

项目管理者联盟招聘内容编辑一名，欢迎大家帮忙推荐。

工作地点：北京市德胜门外

学历：大专以上

工作年限：一年以上

内容编辑的工作内容如下：

1. 负责项目管理者联盟网站[http://www.mypm.net]内容的收集、整理、编辑。

2. 配合内容总编，进行网站栏目的规划和更新、活动组织等工作。

职位的基本要求如下：

1. 具有较强的文字功底、良好的语言表达和沟通能力

2. 熟悉网络操作和office软件操作

3. 为人踏实肯干，男女不限

4. 有网站内容编辑工作经验者优先

5. 对项目管理知识有一定了解者优先

欢迎大家推荐人员应聘。

应聘者请将简历发送邮件至：yhua@mypm.net

并请在简历中注明待遇要求。

Data->spam文件夹样例：

1.txt:

Return-Path: <qing@163.com>

Received: from 163.com ([219.133.253.235])

by spam-gw.ccert.edu.cn (MIMEDefang) with ESMTP id j68BPGTT015150

for <cheng@ccert.edu.cn>; Fri, 08 Jul 2005 19:25:25 +0800 (CST)

Message-ID: <200507081925.j68BPGTT015150@spam-gw.ccert.edu.cn>

From: qing@163.com" <jsj@163.com>

Subject: =?gb2312?B?u6W73bulwPuhormyzay3otW5?=

To: cheng@ccert.edu.cn

Content-Type: text/plain;charset="GB2312"

Reply-To: qing@163.com

Date: Fri, 8 Jul 2005 19:35:35 +0800

X-Priority: 2

X-Mailer: Foxmail 4.1 [cn]

1.txt:

您好！

尊敬的客户：本公司长期代理进出口报关业务.有些发票可以为广大客户优惠代开(税率1.5%左右)

以解广大客户财务票据得不足. 具体有（增值税专用发票、国税商品销售专用发票、地税运输专用

发票、建筑安装专用发票；广告专用发票；还有其他服务发票）等，希望有意者来电详谈，愿合作

愉快，成功！可验证后付款!!

联系人：王政

手机：13670068682。

电话：0755-21151603。

邮箱：haihongsz@126.com

地址：深圳市罗湖区深南路京鹏大厦。

深圳市海宏实业有限公司

Data->test文件夹数据由正常邮件和垃圾邮件混合组成：

附代码：

main.py

#encoding=utf-8
'''
Created on 2018年3月11日

@author: icmll
'''

from spam.spamEmail import spamEmailBayes
import re
#spam类对象
spam=spamEmailBayes()
#保存词频的词典
spamDict={}
normDict={}
testDict={}
#保存每封邮件中出现的词
wordsList=[]
wordsDict={}
#保存预测结果,key为文件名，值为预测类别
testResult={}
#分别获得正常邮件、垃圾邮件及测试文件名称列表
normFileList=spam.get_File_List("./../data/normal")
spamFileList=spam.get_File_List("./../data/spam")
testFileList=spam.get_File_List("./../data/test")
#获取训练集中正常邮件与垃圾邮件的数量
normFilelen=len(normFileList)
spamFilelen=len(spamFileList)
#获得停用词表，用于对停用词过滤
stopList=spam.getStopWords()
#获得正常邮件中的词频
for fileName in normFileList:
    wordsList.clear()
    for line in open("./../data/normal/"+fileName):
        #过滤掉非中文字符
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        #将每封邮件出现的词保存在wordsList中
        spam.get_word_list(line,wordsList,stopList)
    #统计每个词在所有邮件中出现的次数
    spam.addToDict(wordsList, wordsDict)
normDict=wordsDict.copy()  

#获得垃圾邮件中的词频
wordsDict.clear()
for fileName in spamFileList:
    wordsList.clear()
    for line in open("./../data/spam/"+fileName):
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        spam.get_word_list(line,wordsList,stopList)
    spam.addToDict(wordsList, wordsDict)
spamDict=wordsDict.copy()

# 测试邮件
for fileName in testFileList:
    testDict.clear( )
    wordsDict.clear()
    wordsList.clear()
    for line in open("./../data/test/"+fileName):
        rule=re.compile(r"[^\u4e00-\u9fa5]")
        line=rule.sub("",line)
        spam.get_word_list(line,wordsList,stopList)
    spam.addToDict(wordsList, wordsDict)
    testDict=wordsDict.copy()
    #通过计算每个文件中p(s|w)来得到对分类影响最大的15个词
    wordProbList=spam.getTestWords(testDict, spamDict,normDict,normFilelen,spamFilelen)
    #对每封邮件得到的15个词计算贝叶斯概率  
    p=spam.calBayes(wordProbList, spamDict, normDict)
    if(p>0.9):
        testResult.setdefault(fileName,1)
    else:
        testResult.setdefault(fileName,0)
#计算分类准确率（测试集中文件名低于1000的为正常邮件）
testAccuracy=spam.calAccuracy(testResult)
for i,ic in testResult.items():
    print(i+"/"+str(ic))
print(testAccuracy)

spamEmail.py

#encoding=utf-8
'''
Created on 2018年3月11日

@author: ICMLL
'''
import jieba;
import os;
class spamEmailBayes:
    #获得停用词表
    def getStopWords(self):
        stopList=[]
        for line in open("../data/中文停用词表.txt"):
            stopList.append(line[:len(line)-1])
        return stopList;
    #获得词典
    def get_word_list(self,content,wordsList,stopList):
        #分词结果放入res_list
        res_list = list(jieba.cut(content))
        for i in res_list:
            if i not in stopList and i.strip()!='' and i!=None:
                if i not in wordsList:
                    wordsList.append(i)
                    
    #若列表中的词已在词典中，则加1，否则添加进去
    def addToDict(self,wordsList,wordsDict):
        for item in wordsList:
            if item in wordsDict.keys():
                wordsDict[item]+=1
            else:
                wordsDict.setdefault(item,1)
                            
    def get_File_List(self,filePath):
        filenames=os.listdir(filePath)
        return filenames
    
    #通过计算每个文件中p(s|w)来得到对分类影响最大的15个词
    def getTestWords(self,testDict,spamDict,normDict,normFilelen,spamFilelen):
        wordProbList={}
        for word,num  in testDict.items():
            if word in spamDict.keys() and word in normDict.keys():
                #该文件中包含词个数
                pw_s=spamDict[word]/spamFilelen
                pw_n=normDict[word]/normFilelen
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word in spamDict.keys() and word not in normDict.keys():
                pw_s=spamDict[word]/spamFilelen
                pw_n=0.01
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word not in spamDict.keys() and word in normDict.keys():
                pw_s=0.01
                pw_n=normDict[word]/normFilelen
                ps_w=pw_s/(pw_s+pw_n) 
                wordProbList.setdefault(word,ps_w)
            if word not in spamDict.keys() and word not in normDict.keys():
                #若该词不在脏词词典中，概率设为0.4
                wordProbList.setdefault(word,0.47)
        sorted(wordProbList.items(),key=lambda d:d[1],reverse=True)[0:15]
        return (wordProbList)
    
    #计算贝叶斯概率
    def calBayes(self,wordList,spamdict,normdict):
        ps_w=1
        ps_n=1
         
        for word,prob in wordList.items() :
            print(word+"/"+str(prob))
            ps_w*=(prob)
            ps_n*=(1-prob)
        p=ps_w/(ps_w+ps_n)
#         print(str(ps_w)+"////"+str(ps_n))
        return p        

    #计算预测结果正确率
    def calAccuracy(self,testResult):
        rightCount=0
        errorCount=0
        for name ,catagory in testResult.items():
            if (int(name)<1000 and catagory==0) or(int(name)>1000 and catagory==1):
                rightCount+=1
            else:
                errorCount+=1
        return rightCount/(rightCount+errorCount)

代码说明：

python参数的作用域：对于不可变类型的参数，在函数中不会被修改，例如字符串、数字和数组；对于可变类型的参数，在函数中其指向不会被修改，但其内容会被修改，例如列表、字典。
line = re.sub(r"[\u4e00-\u9fa5]","-",string),参数1表示替换规则，即替换前匹配到的数据；参数2表示替换后的数据；参数3表示待扫描的数据。

等价于：

rule = re.compile(r"[\u4e00-\u9fa5]")

line = rule.sub("-",string)
垃圾邮件中的事件相互独立，所以在词语1，词语2出现的情况下，此邮件为垃圾邮件的概率如下：

P(垃圾邮件|词语1，词语2)=

posted on 2019-05-10 23:50 懵懂的菜鸟阅读(3139) 评论(0) 编辑收藏举报

刷新页面返回顶部

懵懂的菜鸟

导航

公告

机器学习实战-朴素贝叶斯垃圾邮件分类

朴素贝叶斯

概念

实战

数据集样例：

Data->normal文件夹下数据样例：

200.txt:

Data->spam文件夹样例：

1.txt:

附代码：

main.py

spamEmail.py

代码说明：