袁家琳

一个完整的大作业

1.选一个自己感兴趣的主题。

2.网络上爬取相关的数据。

3.进行文本分析,生成词云。

4.对文本分析结果解释说明。

5.写一篇完整的博客,附上源代码、数据爬取及分析结果,形成一个可展示的成果。

 

1.一直都有关注搜狐新闻,这次选取的也是搜狐新闻:http://news.sohu.com

 

2.网络上爬取的相关数据

import requests
from bs4 import BeautifulSoup

url = 'http://news.sohu.com/'
res = requests.get(url)
res.encoding = 'UTF-8'

soup = BeautifulSoup(res.text, 'html.parser')

for news in soup.select('.list16'):
    li = news.select('li')  
    if len(li) > 0:      
        title = li[0].text       
        href = li[0].select('a')[0]['href']
        print(title, href)

输出后获取相关信息

 

3.进行文本分析,生成词云

from os import path  
from scipy.misc import imread    
import jieba  
import sys  
import matplotlib.pyplot as plt  
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator    
text = open('D:\\sohu.txt').read()  
wordlist = jieba.cut(text)  
wl_space_split = " ".join(wordlist)  
d = path.dirname(__file__)  
nana_coloring = imread(path.join(d, "D:\\0.jpg"))  
my_wordcloud = WordCloud( background_color = 'white',    
                            mask = nana_coloring,         
                            max_words = 3000,            
                            stopwords = STOPWORDS,  
                            max_font_size = 70,        
                            random_state = 50,            )  
   
text_dict = {   'you': 2993,   'and': 6625,   'in': 2767,   'was': 2525,   'the': 7845,}
my_wordcloud = WordCloud().generate_from_frequencies(text_dict)

image_colors = ImageColorGenerator(nana_coloring)  
my_wordcloud.recolor(color_func=image_colors)  
plt.imshow(my_wordcloud)    
plt.axis("off")             
plt.show()     
my_wordcloud.to_file(path.join(d, "cloudimg.png"))  

生成云图

由此可见,搜狐新闻最近的热点词语是万圣节,摇号,选房等

 

posted @ 2017-10-31 23:10  sister姐姐  阅读(414)  评论(0编辑  收藏  举报