Scopus论文数据爬虫

Scopus是一家文献数据库。它囊括有全球5000多家在科学、技术、医学和社会科学等领域的出版商。

首先爬取Scopus论文数据需要注册一个 elsevier 开发者账号，因为所有API都需要key来访问。API的列表可以查看

https://dev.elsevier.com/api_docs.html

这里有一个需要注意的是

普通的api只能爬取5000条数据，当超过5000条数据的时候，可以通过加cursor=*来获取

Elsevier Developer Portaldev.elsevier.com

正常情况我们会使用python来爬取，这里推荐一个很好用的package “pybliometrics”

pybliometrics: Python-based API-Wrapper to access Scopuspybliometrics.readthedocs.io

作为工具来爬取信息

pip install pybliometrics

接下来就是代码实现了, 再通过publication doi 搜取文章信息的时候，可能会遇到搜索不到的情况

import pybliometrics
from pybliometrics.scopus import AuthorRetrieval
# pybliometrics.scopus.utils.create_config() 配置key
# retrieval
a = AuthorRetrieval('37055346800')
print(a.eid)
print(a.document_count)


#search
from pybliometrics.scopus import AuthorSearch
b = AuthorSearch('AUTHLAST(Selten) and AUTHFIRST(Reinhard)', refresh=True)
print(b)

#search article information
# 此处的文章搜索不到
from pybliometrics.scopus import ScopusSearch
try:
    # a = ScopusSearch('10.1016/S0001-8791(02)00059-3')
    # 更好的文章搜索方式
    a = ScopusSearch('DOI(10.1016/S0001-8791(02)00059-3)')
    print(a.results, sys.argv[2])
except:
    print('a' in locals().keys())

# pybliometrics.scopus.utils.create_config() 配置key

第一次去掉注释,配置APIKey， InstToken不需要设置

vi ~/.scopus/config.ini

[Authentication]
APIKey = 45c21b56a471de9ae547070ca94ab829
InstToken =

cat ~/.scopus/config.ini

当key过期或者超过5000次requests之后，需要更新key

错误码

pybliometrics.scopus.exception.Scopus429Error: QUOTA EXCEEDED

问题：针对某一检索式，scopus数据库导出csv的数据情况为：前2000条数据可以按照勾选的字段导出；前20000条数据只能给出引文信息，且通过邮箱发送。
输入检索式，笔者现在需要17万多文献数据，且所需的信息不只是引文信息，包含以下字段（涉及引文信息、题录信息、摘要和关键字），如下图。
在这里插入图片描述
采取的方法为：
第一步：按照年份进行精简，因为每年的数据都小于2万条，所以每次均可完整地通过邮箱的方式获取到只有引文信息的文献；
第二步，根据引文信息中的链接对每篇文章的摘要、索引关键字、作者关键字等字段进行爬虫，code如下。

# -*- coding: utf-8 -*-
#  爬取scopus的详细信息

import importlib,sys
importlib.reload(sys)
import requests
from lxml import etree
import time

#------将结果写入文件--------
import csv
res = open("content_2014.csv","a",encoding='utf-8',newline='')
writer = csv.writer(res)
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
headers = {"User-Agent": user_agent,"Connection": "close"}  # 请求头,headers是一个字典类型
#--------------------第一步：读取数据----------------------------
word = []
f = open("/Users/sunmengge/Desktop/scorpus/scopus_2014.csv", encoding='utf-8')
lines = f.readlines()
for i in range(1,len(lines)):
    print(i)
    content = []
    url = lines[i].split(",")[-7]
    print(url)
    content.append(url)
    # --------------------第二步：获取每一网页数据----------------------------
    while 1:
        try:
            print("start")
            html = requests.get(url, headers=headers).text
            break
        except:
            print("connection refused by the server..")
            print("let me sleep for 5seconds")
            time.sleep(5)
            print("it is a nice sleep,now let me continue")
            continue
    # --------------------第三步：解析每一网页数据----------------------------
    selector = etree.HTML(html)
    #----------------title-------------------
    name = selector.xpath('//*[@id="profileleftinside"]/div[2]/h2/text()')
    print(name)
    if name == []:
        i = i-2
        continue

    content.append(str(name[0]).strip("\n"))
    #-------------abstract----------------
    abstract = selector.xpath('//*[@id="abstractSection"]/p/text()')
    print(abstract)
    content.append(str(abstract).strip("\n"))
    #-------------authorwords----------------
    authorwords = selector.xpath('/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/div[3]/div[3]/div[1]/div[1]/div[2]/div[2]/section[8]/span/text()')
    print(authorwords)
    content.append(str(authorwords).strip("\n"))
    #-------------indexkeywords----------------
    if authorwords == []:
        section = 8
    else:
        section = 9
    indexkeywords = selector.xpath(
        '/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/div[3]/div[3]/div[1]/div[1]/div[2]/div[2]/section[%d]/table/tr/th/text()'%section)
    if indexkeywords == []:
        section = 5
    indexkeywords = selector.xpath(
        '/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/div[3]/div[3]/div[1]/div[1]/div[2]/div[2]/section[%d]/table/tr/th/text()' % section)
    print(indexkeywords)
    for j in range(len(indexkeywords)):
        k = selector.xpath(
        '/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/div[3]/div[3]/div[1]/div[1]/div[2]/div[2]/section[%(sec)d]/table/tr[%(i)d]/td/span/text()'%{"sec":section,"i":(j+1)})
        print(indexkeywords[j])
        print(k)
        content.append(str(indexkeywords[j]).strip("\n"))
        content.append(str(k).strip("\n"))
    writer.writerow(content)
res.close()

第三步：分年在并行跑以上代码。大约每条文献url的内容获取及解析需要6秒。
希望对大家有帮助！

posted on 2020-05-29 18:45 宏宇阅读(2049) 评论(0) 编辑收藏举报