一个完整的大作业
1.选一个自己感兴趣的主题。
2.网络上爬取相关的数据。
3.进行文本分析,生成词云。
4.对文本分析结果解释说明。
5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。
1、选一个自己感兴趣的主题
我这里选择的主题是豆瓣读书上《二手时间》的短评,爬取的网站是:https://book.douban.com/subject/26704403/comments/
2、获取网页上的短评,并生成文件subjects.txt 代码如下:
from os import path import requests from scipy.misc import imread from wordcloud import WordCloud from bs4 import BeautifulSoup def fetch_douban_comments(): r = requests.get('https://book.douban.com/subject/26704403/comments/') soup = BeautifulSoup(r.text, 'lxml') pattern = soup.find_all('p', 'comment-content') with open('subjects.txt', 'w', encoding='utf-8') as f: for s in pattern: f.write(s.string)
效果如下图:
3、对文本进行分析,并生成词云代码如下:
def extract_words(): with open('subjects.txt','r',encoding='utf-8') as f: comment_subjects = f.readlines() stop_words = set(line.strip() for line in open('stopwords.txt', encoding='utf-8')) commentlist = [] for subject in comment_subjects: if subject.isspace():continue word_list = pseg.cut(subject)#分词 for word, flag in word_list: if not word in stop_words and flag == 'n':#名词 commentlist.append(word)
生成词云:
d = path.dirname(__file__) mask_image = imread(path.join(d, "apple.jpg")) content = ' '.join(commentlist) wordcloud = WordCloud(font_path='simhei.ttf', background_color="white", mask=mask_image, max_words=60).generate(content) # Display the generated image: plt.imshow(wordcloud) plt.axis("off") wordcloud.to_file('wordcloud.jpg') plt.show() if __name__ == "__main__": fetch_douban_comments() extract_words()
生成的词云图为: