使用Python爬取新冠肺炎实时情况

1. 实验目的

• 理解抓取网页的含义和URL基本构成；

• 掌握类和函数及模块的设计与实现；

• 掌握网络爬虫原理。

• 理解Unicode编码。

2. 实验内容

本次实验是通过编写Python网络爬虫，爬取百度提供出的新型冠状病毒肺炎疫情实时大数据报告，从中获取国内目前疫情情况，以及其他国家和地区的目前疫情状况。

爬取字段包括（国内）：

• 省份

• 累计确诊

• 死亡

• 治愈

• 现有确诊

• 累计确诊增量

• 死亡增量

• 治愈增量

• 现有确诊增量

爬取字段包括（国外）：

• 国家

• 累计确诊

• 死亡

• 治愈

• 现有确诊

• 累计确诊增量

3. 实验知识点

• Python基本语法；

• 网络爬虫基本原理；

• 解析HTML页面及URL；

• 爬取Web页面；

• 使用XPath提取关键信息对内容进行过滤。

• Unicode编码理解

4. 实验时长

共4学时：

• 获取网页信息（1学时）

• 爬取国内疫情最新情况（1.5学时）

• 爬取其他国家和地区疫情最新情况（1.5学时）

5. 实验环境

• 双核cpu、4G内存、20G硬盘

• Windows 10操作系统

6. 实验分析

（1）获取https://voice.baidu.com/act/newpneumonia/newpneumonia网站最新信息

（2）观察数据特点获取国内疫情数据：

• 数据包含在script标签里，使用xpath获取数据

• 导入一个模块中from lxml import etree

• 生产一个html对象进行解析

• 得到一个类型为list的内容，使用第一项就可以得到全部内容

• 接下来首先获取component的内容，这时使用json模块，将字符串类型转变为字典(Python的数据结构）

• 为了获取国内的数据，需要在component中找到caseList

• 将国内的数据存储到excel表格中

• 爬取的国内最新疫情数据如下：

（2）观察数据特点获取国外疫情数据：

• 使用openyxl模块，import openpyxl

• 首先创建一个工作簿，在工作簿下创建一个工作表

• 下面给工作表命名和给工作表赋予属性

• 将国外数据存储到excel中

• 在component的globalList中得到国外的数据

• 然后创建excel表格中的sheet即可，分别表示不同的大洲

• 爬取的国内最新疫情数据如下：

• excel表格中的sheet即可，分别表示不同的大洲：

网页解析

爬取的数据为百度提供的最新疫情情况数据（每日更新），同时数据分为国内数据以及国外其他国家的数据。

疫情大数据实时报告：https://voice.baidu.com/act/newpneumonia/newpneumonia

国内数据（各省份疫情最新情况）：

国外数据（各个国家疫情最新情况）：

爬取的国内数据为各省份的最新数据，所以使用Python解析网页代码，根据component中找到caseList来确定国内数据。

查看网页源代码发现，area字段对应省份，confirmed代表累计确诊，died代表累计死亡，crued代表治愈……

同时在area字段后的省份均为Unicode编码格式，可通过编码转换为中文，例如：

[{"confirmed":"1","died":"0","crued":"1","relativeTime":"1624550400","confirmedRelative":"0","diedRelative":"0","curedRelative":"0","asymptomaticRelative":"0","asymptomatic":"0","nativeRelative":"0","curConfirm":"0","curConfirmRelative":"0","overseasInputRelative":"","icuDisable":"1","area":"\u897f\u85cf","subList":

\u897f\u85cf通过编码转换为中文后则代表：西藏

同理在身份标签下出现的city字段，字段后的城市名也为Unicode编码，可通过编码转换为中文。

爬取的国外数据为各个国家的最新数据，所以使用Python解析网页代码，根据component的globalList来确定国外数据。

此时标签中的area字段则代表各个大洲，country代表国家例如：

{"area":"\u4e9a\u6d32","subList":

\u4e9a\u6d32通过编码转换为中文后则代表：亚洲

爬取百度提供的最新疫情情况

1、获取网页信息：

打开桌面上的Pycharm工具，新建NCP_spider项目，而后在项目中创建NCP_spider.py

获取百度疫情实时大数据报告网页信息：

import requests
url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"
response = requests.get(url)

注意：如导包出现错误，可将鼠标移动到飙红的下方，出现错误提示后，点击Install package requests下载插件包

2、解析网页字段，生成HTML对象

可以观察数据的特点：

• 数据包含在script标签里，使用xpath来获取数据。

• 导入一个模块 from lxml import etree

• 生成一个html对象并且进行解析

• 可以得到一个类型为list的内容，使用第一项就可以得到全部内容

• 接下来首先获取component的内容，这时使用json模块，将字符串类型转变为字典(Python的数据结构）

• 为了获取国内的数据，需要在component中找到caseList

代码如下所示：

from lxml import etree
import json
# 生成HTML对象
html = etree.HTML(response.text)
result = html.xpath('//script[@type="application/json"]/text()')
result = result[0]
# json.load()方法可以将字符串转化为python数据类型
result = json.loads(result)
result_in = result['component'][0]['caseList']

3、获取国内疫情最新数据：

• 将国内的数据存储到excel表格中：

• 使用openyxl模块，import openpyxl

• 首先创建一个工作簿，在工作簿下创建一个工作表

• 接下来给工作表命名和给工作表赋予属性

代码如下所示：

import openpyxl
#创建工作簿
wb = openpyxl.Workbook()
#创建工作表
ws = wb.active
ws.title = "国内疫情"
ws.append(['省份', '累计确诊', '死亡', '治愈', '现有确诊', '累计确诊增量', '死亡增量', '治愈增量', '现有确诊增量'])
'''
area --> 大多为省份
confirmed --> 累计确诊
died --> 死亡
crued --> 治愈
curConfirm --> 现有确诊
confirmedRelative --> 累计确诊增量
diedRelative --> 死亡增量
curedRelative --> 治愈增量
curConfirmRelative --> 现有确诊增量
'''
for each in result_in:
    temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'],
                 each['confirmedRelative'], each['diedRelative'], each['curedRelative'],
                 each['curConfirmRelative']]
    for i in range(len(temp_list)):
        if temp_list[i] == '':
            temp_list[i] = '0'
    ws.append(temp_list)
wb.save('./data.xlsx')

4、获取国外疫情最新数据：

• 将国外数据存储到excel中：

• 在component的globalList中得到国外的数据

• 然后创建excel表格中的sheet即可，分别表示不同的大洲

代码如下所示：

data_out = result['component'][0]['globalList']
for each in data_out:
    sheet_title = each['area']
    # 创建一个新的工作表
    ws_out = wb.create_sheet(sheet_title)
    ws_out.append(['国家', '累计确诊', '死亡', '治愈', '现有确诊', '累计确诊增量'])
    for country in each['subList']:
        list_temp = [country['country'], country['confirmed'], country['died'], country['crued'],
                     country['curConfirm'], country['confirmedRelative']]
        for i in range(len(list_temp)):
            if list_temp[i] == '':
                list_temp[i] = '0'
        ws_out.append(list_temp)
wb.save('./data.xlsx')

5、完整示例代码：

import requests
from lxml import etree
import json
import openpyxl
 
url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"
response = requests.get(url)
#print(response.text)
# 生成HTML对象
html = etree.HTML(response.text)
result = html.xpath('//script[@type="application/json"]/text()')
result = result[0]
# json.load()方法可以将字符串转化为python数据类型
result = json.loads(result)
#创建工作簿
wb = openpyxl.Workbook()
#创建工作表
ws = wb.active
ws.title = "国内疫情"
ws.append(['省份', '累计确诊', '死亡', '治愈', '现有确诊', '累计确诊增量', '死亡增量', '治愈增量', '现有确诊增量'])
result_in = result['component'][0]['caseList']
data_out = result['component'][0]['globalList']
'''
area --> 大多为省份
confirmed --> 累计确诊
died --> 死亡
crued --> 治愈
curConfirm --> 现有确诊
confirmedRelative --> 累计确诊增量
diedRelative --> 死亡增量
curedRelative --> 治愈增量
curConfirmRelative --> 现有确诊增量
'''
for each in result_in:
    temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'],
                 each['confirmedRelative'], each['diedRelative'], each['curedRelative'],
                 each['curConfirmRelative']]
    for i in range(len(temp_list)):
        if temp_list[i] == '':
            temp_list[i] = '0'
    ws.append(temp_list)
# 获取国外疫情数据
for each in data_out:
    sheet_title = each['area']
    # 创建一个新的工作表
    ws_out = wb.create_sheet(sheet_title)
    ws_out.append(['国家', '累计确诊', '死亡', '治愈', '现有确诊', '累计确诊增量'])
    for country in each['subList']:
        list_temp = [country['country'], country['confirmed'], country['died'], country['crued'],
                     country['curConfirm'], country['confirmedRelative']]
        for i in range(len(list_temp)):
            if list_temp[i] == '':
                list_temp[i] = '0'
        ws_out.append(list_temp)
wb.save('./data.xlsx')

运行程序，项目目录生成data.xlsx文件

查看结果如下：

• 国内疫情最新数据：