爬虫大作业

1.主题：

爬取传智播客的技术论坛信息，这里我主要对标题信息进行了爬取，爬取信息之后通过jieba分词生成词云并且进行分析；

2.爬取过程：

第一步：

首先打开传智播客，进入Java技术信息交流版块

第一页：http://bbs.itheima.com/forum-231-1.html

因此爬取java技术信息交流版块标题的所有链接可写为：

 for i in range(2,10):
        pages = i;
        nexturl = 'http://bbs.itheima.com/forum-231-%s.html' % (pages)
        reslist = requests.get(nexturl)
        reslist.encoding = 'utf-8'
        soup_list = BeautifulSoup(reslist.text, 'html.parser')
        for news in soup_list.find_all('a',class_='s xst'):
            print(news.text)
            f = open('wuwencheng.txt', 'a', encoding='utf-8')
            f.write(news.text)
            f.close()

第二步：

获取java技术信息交流版块标题：f12打开开发者工具，通过审查，不难发现，我要找的内容在tbody的类下的a标签里

3.把数据保存成文本：

保存成文本代码：

 for news in soup_list.find_all('a',class_='s xst'):
            print(news.text)
            f = open('wuwencheng.txt', 'a', encoding='utf-8')
            f.write(news.text)
            f.close()

4.生成词云：

def changeTitleToDict():
    f = open("wuwencheng.txt", "r", encoding='utf-8')
    str = f.read()
    stringList = list(jieba.cut(str))
    delWord = {"+", "/", "（", "）", "【", "】", " ", "；", "！", "、"}
    stringSet = set(stringList) - delWord
    title_dict = {}
    for i in stringSet:
        title_dict[i] = stringList.count(i)
    print(title_dict)
    return title_dict



# 生成词云
from PIL import Image,ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
# 获取上面保存的字典
title_dict = changeTitleToDict()
graph = np.array(title_dict)
font = r'C:\Windows\Fonts\simhei.ttf'
# backgroud_Image代表自定义显示图片，这里我使用默认的
# backgroud_Image = plt.imread("C:\\Users\\jie\\Desktop\\1.jpg")
# wc = WordCloud(background_color='white',max_words=500,font_path=font, mask=backgroud_Image)
wc = WordCloud(background_color='white',max_words=500,font_path=font)
wc.generate_from_frequencies(title_dict)
plt.imshow(wc)
plt.axis("off")
plt.show()

生成的词云图片：

5.遇到的问题：

本来我是这样生成词典的，但是不行

def getWord():
    lyric = ''
    f = open('wuwencheng.txt', 'r', encoding='utf-8')
    # 将文档里面的数据进行单个读取，便于生成词云
    for i in f:
        lyric += f.read()
        print(i)

    #     进行分析

    result = jieba.analyse.textrank(lyric, topK=2, withWeight=True)
    keywords = dict()
    for i in result:
        keywords[i[0]] = i[1]
    print(keywords)

后来，我改了另一种写法，就可以了：

def changeTitleToDict():
    f = open("wuwencheng.txt", "r", encoding='utf-8')
    str = f.read()
    stringList = list(jieba.cut(str))
    delWord = {"+", "/", "（", "）", "【", "】", " ", "；", "！", "、"}
    stringSet = set(stringList) - delWord
    title_dict = {}
    for i in stringSet:
        title_dict[i] = stringList.count(i)
    print(title_dict)
    return title_dict

安装词云出现的问题：

安装wordcloud库时候回发生报错

解决方法是：

安装提示报错去官网下载vc++的工具，但是安装的内存太大只是几个G
去https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下载whl文件，选取对应python的版本号和系统位数

6.全部代码:

import requests

import jieba.analyse

import jieba
from bs4 import BeautifulSoup
url="http://bbs.itheima.com/forum-231-1.html"
def getcontent(url):
    reslist=requests.get(url)
    reslist.encoding = 'utf-8'
    soup_list = BeautifulSoup(reslist.text, 'html.parser')
    # print(soup_list.select('table .s xst'))
    for news in soup_list.find_all('a',class_='s xst'):
        # print(news.select('tbody')[0].text)wuwencheng
        # a=news.get('a').text
        print(news.text)
        f = open('wuwencheng.txt', 'a', encoding='utf-8')
        f.write(news.text+'  ')
        f.close()
    for i in range(2,10):
        pages = i;
        nexturl = 'http://bbs.itheima.com/forum-231-%s.html' % (pages)
        reslist = requests.get(nexturl)
        reslist.encoding = 'utf-8'
        soup_list = BeautifulSoup(reslist.text, 'html.parser')
        for news in soup_list.find_all('a',class_='s xst'):
            print(news.text)
            f = open('wuwencheng.txt', 'a', encoding='utf-8')
            f.write(news.text)
            f.close()


def changeTitleToDict():
    f = open("wuwencheng.txt", "r", encoding='utf-8')
    str = f.read()
    stringList = list(jieba.cut(str))
    delWord = {"+", "/", "（", "）", "【", "】", " ", "；", "！", "、"}
    stringSet = set(stringList) - delWord
    title_dict = {}
    for i in stringSet:
        title_dict[i] = stringList.count(i)
    print(title_dict)
    return title_dict


# 生成词云
from PIL import Image,ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
# 获取上面保存的字典
title_dict = changeTitleToDict()
graph = np.array(title_dict)
font = r'C:\Windows\Fonts\simhei.ttf'
# backgroud_Image代表自定义显示图片，这里我使用默认的
# backgroud_Image = plt.imread("C:\\Users\\jie\\Desktop\\1.jpg")
# wc = WordCloud(background_color='white',max_words=500,font_path=font, mask=backgroud_Image)
wc = WordCloud(background_color='white',max_words=500,font_path=font)
wc.generate_from_frequencies(title_dict)
plt.imshow(wc)
plt.axis("off")
plt.show()

posted on 2018-04-24 23:51 217-吴文成阅读(227) 评论(0) 编辑收藏举报

刷新页面返回顶部

wwc--cww

爬虫大作业

导航

公告