爬虫大作业
python爬虫 抓取《极简宇宙史》书评,制作词云
分析网页源码,进行数据抓取
1 coding = "UTF-8" 2 #引入 requests BeautifulSoup 两个 3 import requests 4 from bs4 import BeautifulSoup 5 import time 6 #先命名一个空list 7 sm_list = [] 8 #填写头部信息 9 headers = {'Host':'book.douban.com', 10 'Referer':'https://book.douban.com/subject/26697350/', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' } 11 #先利用requests请求网页信息,这里的循环主要作用是我们发现书评有1354条,每页有20条,将近70页,这里用字符串格式化操作,进行遍历抓取数据 12 for page in range(1,70): 13 url = 'https://book.douban.com/subject/26697350/comments/hot?p={}'.format(page) 14 req = requests.get(url,headers=headers) 15 content = req.content 16 soup = BeautifulSoup(content,'lxml') 17 article = soup.find('div',class_='article') 18 list = article.find_all('p',attrs ={'class':'comment-content'},) 19 for i in list: 20 shuji_list = i.get_text().replace('\n','') 21 sm_list.append(shuji_list) 22 #这里我设置时间间隔为1秒 23 time.sleep(1) 24 print(page) 25 print(sm_list)
数据抓取完之后,删除标点符号。把全部书评复制到桌面的一个新建文本文档中
生成词云
1 #导入三个库 2 import matplotlib.pyplot as plt 3 from wordcloud import WordCloud 4 import jieba 5 #用open函数打开文档,读取内容存放在text文件中 6 text = open('C:\\Users\\Administrator\\Desktop\\jijian.txt').read() 7 #用jieba库对文档进行分词,分词后存储在一个叫wordlist_jieba的文件中,这个文件为list格式 8 wordlist_jieba = jieba.cut(text, cut_all = True) 9 #用join函数将list转为str 10 wordstr = " ".join(wordlist_jieba) 11 #用wordcloud生成词云 12 my_wordcloud = WordCloud().generate(wordstr) 13 plt.imshow(my_wordcloud) 14 plt.axis("off") 15 plt.show()
运行程序之后,生成的词云如图所示: