爬虫大作业

python爬虫 抓取《极简宇宙史》书评,制作词云

 

分析网页源码,进行数据抓取

 1 coding = "UTF-8"
 2 #引入 requests BeautifulSoup 两个
 3 import requests
 4 from bs4 import BeautifulSoup
 5 import time
 6 #先命名一个空list
 7 sm_list = []
 8 #填写头部信息
 9 headers = {'Host':'book.douban.com',
10  'Referer':'https://book.douban.com/subject/26697350/', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' }
11 #先利用requests请求网页信息,这里的循环主要作用是我们发现书评有1354条,每页有20条,将近70页,这里用字符串格式化操作,进行遍历抓取数据
12 for page in range(1,70):
13     url = 'https://book.douban.com/subject/26697350/comments/hot?p={}'.format(page)
14     req = requests.get(url,headers=headers)
15     content = req.content
16     soup = BeautifulSoup(content,'lxml')
17     article = soup.find('div',class_='article')
18     list = article.find_all('p',attrs ={'class':'comment-content'},)
19     for i in list:
20      shuji_list = i.get_text().replace('\n','')
21         sm_list.append(shuji_list)
22   #这里我设置时间间隔为1秒
23 time.sleep(1)
24 print(page)
25 print(sm_list)

 

数据抓取完之后,删除标点符号。把全部书评复制到桌面的一个新建文本文档中

 

生成词云

 1 #导入三个库
 2 import matplotlib.pyplot as plt
 3 from wordcloud import WordCloud
 4 import jieba
 5 #用open函数打开文档,读取内容存放在text文件中
 6 text = open('C:\\Users\\Administrator\\Desktop\\jijian.txt').read()
 7 #用jieba库对文档进行分词,分词后存储在一个叫wordlist_jieba的文件中,这个文件为list格式
 8 wordlist_jieba = jieba.cut(text, cut_all = True)
 9 #用join函数将list转为str
10 wordstr = " ".join(wordlist_jieba)
11 #用wordcloud生成词云
12 my_wordcloud = WordCloud().generate(wordstr)
13 plt.imshow(my_wordcloud)
14 plt.axis("off")
15 plt.show()

运行程序之后,生成的词云如图所示:

 

posted on 2018-04-24 18:17  105李梓维  阅读(182)  评论(0编辑  收藏  举报

导航