TF-IDF 实践
打算分以下几个部分进行
1. 用python写一个爬虫爬取网易新闻
2. 用分词工具对爬下来的文字进行处理, 形成语料库
3. 根据TF-IDF, 自动找出新闻的关键词
4. 根据TF-IDF, 实现相似新闻推荐
step 1a
今天一天都在弄python爬虫, 花了好大力气才写出一个勉强可用的版本
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 6 newsLink = set()##获取的所有新闻 7 processLink = set()##正在处理的新闻 8 newLink = set()##新读取的新闻 9 viewedLink = set()##已经读取过的新闻 10 11 ##打开输入的链接, 用正则表达式找出新页面中其他的链接, 并添加到全局set中 12 def getNewsLink(link): 13 ##print link 14 if(link in viewedLink): 15 return 16 viewedLink.add(link) 17 content = "" 18 try:##这一步可能会抛出异常 19 content = urllib.urlopen(link).read().decode('gbk').encode('utf-8') 20 except: 21 info=sys.exc_info() 22 print info[0],":",info[1] 23 print "caused by link : ", link 24 m = re.findall(r"news\.163\.com/\d{2}/\d{4}/\d{2}/\w+\.html",content,re.M)##网易新闻链接格式为http://news.163.com/14/0621/12/9V8V9AL60001124J.html 25 for i in m: 26 url = "http://" + i 27 newLink.add(url) 28 newsLink.add(url) 29 print "crawled %d page, get %d link"%(len(viewedLink), len(newsLink)) 30 31 ##将读取到的新闻ID存入数据库中 32 def saveNewsIDtoDB(): 33 newsID = dict() 34 for link in newsLink: 35 ID = link[31:47] 36 newsID[ID] = link##截取其中新闻ID 37 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 38 cursor = conn.cursor() 39 for (ID, url) in newsID.items(): 40 sql = "INSERT INTO News(NewsID, Url) VALUES ('%s','%s')"%(ID, url) 41 try: 42 cursor.execute(sql) 43 except: 44 info=sys.exc_info() 45 print info[0],":",info[1] 46 print "caused by sql : ", sql 47 conn.commit() 48 conn.close() 49 print "total get %d news ID"%(len(newsID)) 50 51 ##读取指定数量的新闻 52 def readNews(count): 53 processLink = set() 54 processLink.add("http://news.163.com/") 55 while(len(newsLink) < count): 56 for link in processLink: 57 getNewsLink(link) 58 processLink = newLink.copy() 59 newLink.clear() 60 61 readNews(10000) 62 saveNewsIDtoDB()
实现了自动抓取指定数量的新闻并将其ID存入数据库
网易新闻没有公开其API, 但是新闻链接的格式都是固定的
如同http://news.163.com/14/0621/12/9V8V9AL60001124J.html, 14代表年份, 0621代表日期, 12不知道什么意思, 但是一定是两位数字, 后面的16位字符串就是新闻ID
跑了几十分钟, 抓了10360个新闻链接
step 1b
用BeautifulSoup解析链接, 得到新闻的标题, 正文, 和发布时间
跑了接近一个小时吧, 得到9714条新闻记录, 中间折损了接近一千条, 有的是新闻已经被删除了, 也有的是因为新闻正文格式不对, 抓了一堆JS代码进来, 存到数据库的时候就报错了
不过已经够了
解析代码如下
1 # encoding: utf-8 2 import re, urllib, sys 3 import pyodbc, json 4 import socket 5 from bs4 import BeautifulSoup 6 socket.setdefaulttimeout(10.0) 7 8 def readNews(): 9 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 10 cursor = conn.cursor() 11 sql = "SELECT * FROM News" 12 cursor.execute(sql) 13 rows = cursor.fetchall() 14 15 updateCount = 0; 16 17 for row in rows:#从数据库中读取链接 18 print row.NewsID, row.Url 19 content = "" 20 ptime = "" 21 title = "" 22 body = "" 23 newsID = row.NewsID.strip() 24 try:##这一步可能会抛出异常 25 content = urllib.urlopen(row.Url).read()#读取网页内容 26 ptime = "20" + row.Url[20:22] + "-" + row.Url[23:25] + "-" + row.Url[25:27]#新闻发布日期 27 title, body = analyzeNews(content)#解析网页内容, 获取新闻标题与正文 28 except: 29 info=sys.exc_info() 30 print info[0],":",info[1] 31 print "caused by link : ", row.Url 32 continue 33 34 sql = "UPDATE News SET Title = '%s', Body = '%s',ptime = '%s' WHERE NewsID = '%s'"%(title, body, ptime, newsID)#生成sql语句 35 try:##这一步可能会抛出异常 36 cursor.execute(sql) 37 except: 38 info=sys.exc_info() 39 print info[0],":",info[1] 40 print "caused by sql : ", sql 41 continue 42 updateCount += 1 43 if(updateCount % 100 == 0): 44 conn.commit() 45 print "已经更新了%s条数据!"%(updateCount) 46 conn.commit() 47 conn.close() 48 print "数据处理完毕, 一共更新了%s条数据!"%(updateCount) 49 50 def analyzeNews(content): 51 soup = BeautifulSoup(content, from_encoding="gb18030") 52 title = soup.title.get_text()[:-7] 53 bodyHtml = soup.find(id = "endtext") 54 if(bodyHtml == None): 55 bodyHtml = soup.find(id = "text") 56 if(bodyHtml == None): 57 bodyHtml = soup.find(id = "endText") 58 body = bodyHtml.get_text() 59 body = re.sub("\n+", "\n", body)#去除连续的换行符 60 print title 61 return title, body 62 63 readNews()
step 2
用结巴分词对新闻做分词并存入数据库中, 标题的权重设为正文的五倍
没想到数据库的效率这么高, 每秒钟居然能执行近万条插入语句
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import jieba 6 7 stop = [line.strip().decode('utf-8') for line in open('chinese_stopword.txt').readlines() ] 8 9 def readNewsContent(): 10 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 11 cursor = conn.cursor() 12 sql = "SELECT * FROM News" 13 cursor.execute(sql) 14 rows = cursor.fetchall() 15 16 word_dict = dict()#所有词的频数 17 18 insert_count = 0; 19 for row in rows:#从数据库中读取新闻 20 content = row.Body 21 title = row.Title 22 newsID = row.NewsID.strip() 23 seg_dict = sliceNews(title, content)#切词 24 25 newsWordCount = 0 26 for(word, count) in seg_dict.items(): 27 newsWordCount += count 28 sql = "INSERT INTO ContentWord(Word, Count, NewsID) VALUES ('%s',%d, '%s')"%(word, count, newsID)#将每篇新闻的词频存入数据库中 29 cursor.execute(sql) 30 insert_count += 1 31 if(insert_count % 10000 == 0): 32 print "插入%d条新闻词频记录!"%(insert_count) 33 if(word in word_dict):#维护word_dict 34 word_dict[word] += 1 35 else: 36 word_dict[word] = 1 37 sql = "UPDATE News SET WordCount = '%d' WHERE NewsID = '%s'"%(newsWordCount, newsID) 38 cursor.execute(sql) 39 conn.commit() 40 print "一共插入%d条新闻词频记录!"%(insert_count) 41 42 #将word_dict存入数据库中 43 for(word, count) in word_dict.items(): 44 sql = "INSERT INTO TotalWord(Word, Count) VALUES ('%s',%d)"%(word, count) 45 cursor.execute(sql) 46 print "插入%d条总词频记录!"%(len(word_dict.items())) 47 conn.commit() 48 conn.close() 49 50 #对输入文字切词, 并返回去除停用词后的词频 51 def sliceNews(title, content): 52 title_segs = list(jieba.cut(title)) 53 segs = list(jieba.cut(content)) 54 for i in range(5):#标题权重算正文权重的五倍 55 segs += title_segs 56 57 seg_set = set(segs) 58 seg_dict = dict() 59 for seg in seg_set:#去除停用词, 并得到这篇新闻里的词频 60 if(seg not in stop and re.match(ur"[\u4e00-\u9fa5]+", seg)):#只匹配中文 61 seg_dict[seg] = segs.count(seg) 62 63 return seg_dict 64 65 readNewsContent()
几分钟就跑完了, 一共插入1475330条新闻词频记录和135961条总词频记录
step 3
然后对分词结果做计算, 求其TF-IDF值, 得到每篇新闻的TF-IDF值最高的头20个词语, 作为关键词, 并保存到数据库中
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import math 6 7 8 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 9 cursor = conn.cursor() 10 newsCount = 0; 11 totalWordDict = dict() 12 13 def init(): 14 #读取所有新闻数 15 sql = "SELECT COUNT(*) FROM News" 16 cursor.execute(sql) 17 row = cursor.fetchone() 18 global newsCount 19 newsCount = int(row[0]) 20 #读取总词频并构造字典 21 sql = "SELECT * FROM TotalWord" 22 cursor.execute(sql) 23 rows = cursor.fetchall() 24 for row in rows: 25 totalWordDict[row.Word.strip()] = int(row.Count) 26 27 def clean(): 28 conn.commit() 29 conn.close() 30 31 #计算所有新闻的关键词的tf-idf值 32 def cacluTFIDF(): 33 sql = "SELECT * FROM NEWS"#遍历新闻 34 cursor.execute(sql) 35 rows = cursor.fetchall() 36 insertCount = 0 37 for row in rows:#对每一条新闻计算其关键词的TFIDF值 38 newsID = row.NewsID.strip() 39 keyWordList = calcuKeyWords(newsID) 40 for keyWord in keyWordList:#将计算出的TFIDF值存入数据库中 41 word = keyWord[0] 42 value = keyWord[1] 43 sql = "INSERT INTO TFIDF(Word, Value, NewsID) VALUES ('%s',%f, '%s')"%(word, value, newsID) 44 cursor.execute(sql) 45 insertCount += 1 46 if(insertCount % 10000 == 0): 47 print "插入%d条TFIDF记录!"%(insertCount) 48 conn.commit() 49 print "一共插入%d条TFIDF记录!"%(insertCount) 50 51 #计算指定新闻的关键词 52 def calcuKeyWords(newsID): 53 newsID = newsID.strip() 54 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 55 cursor.execute(sql) 56 newsWordCount = cursor.fetchone().WordCount#新闻的总词数 57 58 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s'"%(newsID) 59 cursor.execute(sql) 60 rows = cursor.fetchall() 61 tfidf_dict = dict() 62 global newsCount 63 #构建这篇新闻的tf-idf字典 64 for row in rows: 65 word = row.Word.strip() 66 count = row.Count 67 tf = float(count) / newsWordCount 68 idf = math.log(float(newsCount) / (totalWordDict[word] + 1)) 69 tfidf = tf * idf 70 tfidf_dict[word] = tfidf 71 #取前20个关键词 72 keyWordList = sorted(tfidf_dict.items(), key=lambda d: d[1])[-20:] 73 return keyWordList 74 75 76 init() 77 cacluTFIDF() 78 clean()
比方说对于 重庆东胜煤矿5名遇难者遗体全部找到 这条新闻
程序计算出来的关键词, 按权重从低到高排列分别为:
窜\年产\采空区\工人\冒落\东翼\煤约\矸\重庆市\南川\名\顶板\工作面\采煤\找到\遇难者\重庆\遗体\煤矿\东胜
step 4
然后就可以根据关键词来做自动推荐了
具体操作如下(引用自阮一峰的博客)
(1)使用TF-IDF算法,找出两篇文章的关键词;
(2)每篇文章各取出若干个关键词(比如20个),合并成一个集合,计算每篇文章对于这个集合中的词的词频(为了避免文章长度的差异,可以使用相对词频);
(3)生成两篇文章各自的词频向量;
(4)计算两个向量的余弦相似度,值越大就表示越相似。
代码如下
1 # -*- coding: utf-8 -* 2 3 import re, urllib, sys 4 import pyodbc 5 import math 6 7 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=STEVEN-PC\\MSSQLSERVER_R2;DATABASE=TF-IDF;UID=sa;PWD=123456') 8 cursor = conn.cursor() 9 10 def clean(): 11 conn.commit() 12 conn.close() 13 14 #计算两条新闻的相似度, 返回结果为这两条新闻的关键词之间的余弦距离 15 def similar(newsID1, newsID2): 16 newsID1 = newsID1.strip() 17 newsID2 = newsID2.strip() 18 #取得待对比的两个新闻的关键词集合 19 sql = "SELECT * FROM TFIDF WHERE NewsID = '%s' OR NewsID = '%s'"%(newsID1, newsID2) 20 cursor.execute(sql) 21 rows = cursor.fetchall() 22 wordSet = set() 23 for row in rows: 24 wordSet.add(row.Word) 25 #计算两条新闻中关键词的各自出现次数, 用向量表示 26 vector1 = [] 27 vector2 = [] 28 for word in wordSet: 29 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID1, word) 30 cursor.execute(sql) 31 rows = cursor.fetchall() 32 if len(rows) == 0: 33 vector1.append(0) 34 else: 35 vector1.append(int(rows[0].Count)) 36 sql = "SELECT * FROM ContentWord WHERE NewsID = '%s' AND Word = '%s'"%(newsID2, word) 37 cursor.execute(sql) 38 rows = cursor.fetchall() 39 if len(rows) == 0: 40 vector2.append(0) 41 else: 42 vector2.append(int(rows[0].Count)) 43 return calcuCosDistance(vector1, vector2) 44 45 #计算两个输入向量之间的余弦距离 46 def calcuCosDistance(a, b): 47 if len(a) != len(b): 48 return None 49 part_up = 0.0 50 a_sq = 0.0 51 b_sq = 0.0 52 for a1, b1 in zip(a,b): 53 part_up += a1*b1 54 a_sq += a1**2 55 b_sq += b1**2 56 part_down = math.sqrt(a_sq*b_sq) 57 if part_down == 0.0: 58 return None 59 else: 60 return part_up / part_down 61 62 #输入一个新闻ID, 输出与其最相似的头几条新闻 63 def recommand(newsID): 64 limit = 5 65 result = dict() 66 sql = "SELECT * FROM NEWS"#遍历新闻 67 cursor.execute(sql) 68 rows = cursor.fetchall() 69 70 newsID = newsID.strip() 71 calcuCount = 0 72 for row in rows: 73 calcuCount += 1 74 if calcuCount % 200 == 0: 75 print "已经计算了%d对新闻的相似度"%(calcuCount) 76 if row.NewsID.strip() != newsID:#去掉本身 77 distance = similar(newsID, row.NewsID)#计算两个新闻的相似度 78 if len(result) < limit: 79 result[distance] = row.NewsID 80 else: 81 minDis = min(result.keys()) 82 if(minDis < distance): 83 del result[minDis] 84 result[distance] = row.NewsID 85 86 print "输入的新闻编号为%s"%(newsID) 87 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 88 cursor.execute(sql) 89 row = cursor.fetchone() 90 print "输入的新闻链接为: %s"%(row.Url.encode('utf-8')) 91 print "输入的新闻标题为: %s"%(row.Title.decode('gb2312').encode('utf-8')) 92 print "--------------------------------------" 93 for sim, newsID in result.items(): 94 sql = "SELECT * FROM NEWS WHERE NewsID = '%s'"%(newsID) 95 cursor.execute(sql) 96 row = cursor.fetchone() 97 print "推荐新闻的相似度为: %f"%(sim) 98 print "推荐新闻的编号为: %s"%(row.NewsID.encode('utf-8')) 99 print "推荐新闻的链接为: %s"%(row.Url.encode('utf-8')) 100 print "推荐新闻的标题为: %s"%(row.Title.decode('gb2312').encode('utf-8')) 101 print "" 102 103 #print similar("2IK789GB0001121M", "2IKJ8KRJ0001121M") 104 recommand("A4AVPKLA00014JB5") 105 clean()
1 输入的新闻编号为: A4AVPKLA00014JB5 2 输入的新闻链接为: http://news.163.com/14/0823/10/A4AVPKLA00014JB5.html 3 输入的新闻标题为: 重庆东胜煤矿5名遇难者遗体全部找到 4 -------------------------------------- 5 推荐新闻的相似度为: 0.346214 6 推荐新闻的编号为: A4BHA5OO0001124J 7 推荐新闻的链接为: http://news.163.com/14/0823/15/A4BHA5OO0001124J.html 8 推荐新闻的标题为: 安徽淮南煤矿爆炸事故救援再次发现遇难者遗体 9 10 推荐新闻的相似度为: 0.356118 11 推荐新闻的编号为: 8H0Q439K00011229 12 推荐新闻的链接为: http://news.163.com/12/1123/16/8H0Q439K00011229.html 13 推荐新闻的标题为: 安徽淮北首富被曝用500万元买通矿难遇难者家属 14 15 推荐新闻的相似度为: 0.320387 16 推荐新闻的编号为: A3MBB7CF00014JB6 17 推荐新闻的链接为: http://news.163.com/14/0815/10/A3MBB7CF00014JB6.html 18 推荐新闻的标题为: 黑龙江鸡西煤矿透水事故9人升井 仍有16名矿工被困 19 20 推荐新闻的相似度为: 0.324280 21 推荐新闻的编号为: 5Q92I93D000120GU 22 推荐新闻的链接为: http://news.163.com/09/1211/16/5Q92I93D000120GU.html 23 推荐新闻的标题为: 土耳其煤矿发生瓦斯爆炸 19名矿工全部遇难 24 25 推荐新闻的相似度为: 0.361950 26 推荐新闻的编号为: 6D7J4VLR00014AED 27 推荐新闻的链接为: http://news.163.com/10/0804/05/6D7J4VLR00014AED.html 28 推荐新闻的标题为: 贵州一煤矿发生煤与瓦斯突出事故
推荐内容的关联性很好
不过, 由于推荐操作需要对数据库进行遍历, 时间复杂度非常高, 对单个新闻做关联推荐耗时大约在10分钟左右, 实际使用肯定是无法接受的
但是, 毕竟只是个很粗糙的测试, 我个人还是非常满意的
我的感受是: 算法挺神奇, 在上面的代码中, 完全不需要知道具体的新闻内容, 程序就能自动做出相当准确的判断, 非常方便而且有趣
中间还是有很多可以优化的地方
比如爬取新闻的时候可以删除部分无用信息(来源, 记者姓名之类)
根据词语出现的位置, 对TF-IDF值进行修正, 比方说第一段和每一段的第一句话的TF-IDF值应当更高一点
对新闻进行粗略分类, 在对一篇新闻做关联推荐的时候, 不需要遍历整个新闻库
参考资料: 阮一峰的博客 http://www.ruanyifeng.com/blog/2013/03/tf-idf.html