本周总结

一、
(1) 项目名称:信息化领域热词分类分析及解释
(2) 功能设计:
数据采集:要求从定期自动从网络中爬取信息领域的相关热
词;
 数据清洗:对热词信息进行数据清洗,并采用自动分类技术
生成信息领域热词目录,;
热词解释:针对每个热词名词自动添加中文解释(参照百度
百科或维基百科)
热词引用:并对近期引用热词的文章或新闻进行标记,生成
超链接目录,用户可以点击访问;
 数据可视化展示:
① 用字符云或热词图进行可视化展示;
② 用关系图标识热词之间的紧密程度。
 
首先我爬取热词的地址是博客园:https://news.cnblogs.com/n/recommend
python代码:
import requests
import re
import xlwt
url = 'https://news.cnblogs.com/n/recommend'
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}
def get_page(url):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print('获取网页成功')
            print(response.encoding)
            return response.text
        else:
            print('获取网页失败')
    except Exception as e:
        print(e)
f = xlwt.Workbook(encoding='utf-8')
sheet01 = f.add_sheet(u'sheet1', cell_overwrite_ok=True)
sheet01.write(0, 0, '博客最热新闻')  # 第一行第一列
urls = ['https://news.cnblogs.com/n/recommend?page={}'.format(i * 1) for i in range(100)]
temp=0
num=0
for url in urls:
    print(url)
    page = get_page(url)
    items = re.findall('<h2 class="news_entry">.*?<a href=".*?" target="_blank">(.*?)</a>',page,re.S)
    print(len(items))
    print(items)
    for i in range(len(items)):
        sheet01.write(temp + i + 1, 0, items[i])
    temp += len(items)
    num+=1
    print("已打印完第"+str(num)+"页")
print("打印完!!!")
f.save('Hotword.xls')

爬取结果截图:

 

 然后继续在爬取结果里面进行筛选,选出100个出现频率最高的信息热词。

Python代码:

import jieba
import pandas as pd
import re
from collections import Counter
 
if __name__ == '__main__':
    filehandle = open("Hotword.txt", "r", encoding='utf-8');
    mystr = filehandle.read()
    seg_list = jieba.cut(mystr)  # 默认是精确模式
    print(seg_list)
    # all_words = cut_words.split()
    # print(all_words)
    stopwords = {}.fromkeys([line.rstrip() for line in open(r'final.txt',encoding='UTF-8')])
    c = Counter()
    for x in seg_list:
        if x not in stopwords:
            if len(x) > 1 and x != '\r\n':
                c[x] += 1
 
    print('\n词频统计结果:')
    for (k, v) in c.most_common(100):  # 输出词频最高的前两个词
        print("%s:%d" % (k, v))
 
    # print(mystr)
    filehandle.close();
    # seg2 = jieba.cut("好好学学python,有用。", cut_all=False)
    # print("精确模式(也是默认模式):", ' '.join(seg2))

里面的那个final.txt是将那些单词比如“我们”,“什么”,“中国”,“没有”,这些句子常出现的词语频率高但是跟信息没有关系的词语,我们将他们首先排除。

final.txt:

 

 

运行结果:

 

 然后将他们存入txt,导入mysql。

之后我们继续进行爬取,爬取百度百科每个热词的解释。

Python源代码:

import requests
import re
import xlwt
import linecache
url = 'https://baike.baidu.com/'
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}
def get_page(url):
    try:
        response = requests.get(url, headers=headers)
        response.encoding = 'utf-8'
        if response.status_code == 200:
            print('获取网页成功')
            #print(response.encoding)
            return response.text
        else:
            print('获取网页失败')
    except Exception as e:
        print(e)
f = xlwt.Workbook(encoding='utf-8')
sheet01 = f.add_sheet(u'sheet1', cell_overwrite_ok=True)
sheet01.write(0, 0, '热词')  # 第一行第一列
sheet01.write(0, 1, '热词解释')  # 第一行第二列
sheet01.write(0, 2, '网址')  # 第一行第三列
fopen = open('C:\\Users\\hp\\Desktop\\final_hotword2.txt', 'r',encoding='utf-8')
lines = fopen.readlines()
urls = ['https://baike.baidu.com/item/{}'.format(line) for line in lines]
i=0
for url in urls:
     print(url.replace("\n", ""))
     page = get_page(url.replace("\n", ""))
     items = re.findall('<meta name="description" content="(.*?)">',page,re.S)
     print(items)
     if len(items)>0:
            sheet01.write(i + 1, 0,linecache.getline("C:\\Users\\hp\\Desktop\\final_hotword2.txt", i+1).strip())
            sheet01.write(i + 1, 1,items[0])
            sheet01.write(i + 1, 2,url.replace("\n", ""))
            i+= 1
     print("总爬取完毕数量:" + str(i))
print("打印完!!!")
f.save('hotword_explain.xls')

 

posted @ 2022-08-27 15:27  好(justice)……  阅读(7)  评论(0编辑  收藏  举报