Python抓取豆瓣《白夜追凶》的评论并且分词
最近网剧《白夜追凶》在很多朋友的推荐下,开启了追剧模式,自从琅琊榜过后没有看过国产剧了,此剧确实是良心剧呀!一直追下去,十一最后两天闲来无事就抓取豆瓣的评论看一下
相关代码提交到github上
个人github上相关python的项目:https://github.com/bytename/learnPy
#-*-coding:utf-8-*- import requests from lxml import etree import jieba header ={ "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, br", "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6", "Connection":"keep-alive", "Host":"movie.douban.com", "Referer":"https://movie.douban.com/subject/26883064/reviews?start=20", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36" } def getPageNum(url): if url: req = requests.get(url,headers=header) html = etree.HTML(req.text) pageNum = html.xpath(u"//div[@class='paginator']/a[last()]/text()")[0] return pageNum def getContent(url): if url: req = requests.get(url, headers=header) html = etree.HTML(req.text) data = html.xpath(u"//div[@class='short-content']/text()") return data def getUrl(pageNum): dataUrl= [] for i in range(1,int(pageNum)): if pageNum >= 1: url ="https://movie.douban.com/subject/26883064/reviews?start=%d" %(((i - 1) *20),) dataUrl.append(url) return dataUrl if __name__ == '__main__': url = "https://movie.douban.com/subject/26883064/reviews?start=0" pageNum =getPageNum(url) data = getUrl(pageNum) datas = [] dic = dict() for u in data: for d in getContent(u): jdata = jieba.cut(d) for i in jdata: if len(i.strip()) > 1: datas.append(i.strip()) for i in datas: if datas.count(i) > 1: dic[i] = datas.count(i) for key,values in dic.items(): print "%s===%d" %(key,values)
抓取了评论并分词统计:
C:\Anaconda2\python.exe D:/PycharmProjects/LearnPy/lesson01/SpriderDouBan.py Building prefix dict from the default dictionary ... Loading model from cache c:\users\rc\appdata\local\temp\jieba.cache Loading model cost 0.379 seconds. Prefix dict has been built succesfully. 结合体===2 星期一===2 出来===21 第二===2 还要===3 应该===28 刘副队===3 案件===33 发生===7 成分===3 诚然===2 惊喜===7 两天===5 正常===10 全剧===4 看似===2 关系===5 坐等===2 仿佛===2 有理有据===2