顶会热词统计 - 顺儿

顶会热词统计

要求：对爬取的信息进行结构化处理，分析top10个热门领域或热门研究方向；可进行论文检索，当用户输入论文编号、题目、关键词等基本信息，分析返回相关的paper、source code、homepage等信息

大体思路：

1、先从数据导入文章标题进行单词统计（需要清洗无意义的词）

2、再将统计好的单词按降序顺序选择前百分之20（定义热词）

3、将前20%的热词与文章标题进行比对，将其含有的热词写入到数据库对应的列中，方便前端进行获取

最终结果：

数据库中每个元组都包含：标题、摘要、标题所包含的关键热词以及文章链接；

爬取源码：

import re
import requests
import pymysql

def insertCvpr(value):
    db = pymysql.connect("localhost", "root", "root", "cvprlist", charset='utf8')  # 连接数据库

    cursor = db.cursor()
    sql="""insert into cvpr values(%s,%s,%s,%s)"""
    try:
        cursor.execute(sql, value)
        db.commit()
        print('插入数据成功')
    except:
        db.rollback()
        print("插入数据失败")
    db.close()


url="http://openaccess.thecvf.com/ICCV2019.py";
header={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36 Edg/81.0.416.53"}
res=requests.get(url,headers=header);
res.encoding="utf-8";
list=re.findall("""<dt class="ptitle"><br><a href="(.*?)">.*?</a></dt>""",res.text,re.S);
for item in list:
    # print(item)
    res=requests.get("http://openaccess.thecvf.com/"+item) #爬取到的网站是相对路径，所以要补全，下方同理
    res.encoding="utf-8"
    title=re.findall("""<div id="papertitle">(.*?)</div>""",res.text,re.S)
    summry=re.findall("""<div id="abstract" >(.*?)</div>""",res.text,re.S)
    link=re.findall("""\[<a href="\.\./\.\./(.*?)">pdf</a>\]""",res.text,re.S)
    if(len(title)>0):   #有的网站可能爬取不到，数组为空，直接获取会导致程序崩溃
        insertCvpr((title[0].replace("\n", ""),summry[0].replace("\n", ""),title[0].replace("\n", ""),"http://openaccess.thecvf.com/"+link[0]))

posted on 2020-04-15 13:52 顺儿阅读(148) 评论(0) 收藏举报