搜狗·疫情数据爬取(Python)

上周已经分享过搜狗·疫情数据爬取(R语言),这次分享一下搜狗·疫情数据爬取(Python)

不说废话,直接上代码。有什么问题,可以在留言区讨论。

from urllib import request
from lxml import etree
import re
import pandas as pd
import json

url="http://sa.sogou.com/new-weball/page/sgs/epidemic?type_page=WEB"
response = request.urlopen(url)      #请求

html = response.read()#获取
html = html.decode("utf-8")#解码

xml = etree.HTML(html)
datas = xml.xpath('//html/body/script[1]/text()')

datas=re.sub('window.type_page = \"WEB\"\n      window.__INITIAL_STATE__ = ',"",datas[0])

area=json_data["data"]["area"]

citytempdate = []
provincetempdate = []
for i in area:
    provinceShortName = i["provinceShortName"]
    confirmedCount = i["confirmedCount"]
    curedCount = i["curedCount"]
    deadCount = i["deadCount"]
    provincetempdate.append([provinceShortName,confirmedCount,curedCount,deadCount])
    for j in i["cities"]:
        cityName = j["cityName"]
        confirmedCount=j["confirmedCount"]
        curedCount=j["curedCount"]
        deadCount=j["deadCount"]
        citytempdate.append([provinceShortName,cityName,confirmedCount,curedCount,deadCount])
  
dt_city = pd.DataFrame(citytempdate,columns=["PROVINCESHORTNAME","CITYNAME","CONFIRMEDCOUNT","CUREDCOUNT","DEADCOUNT"])
dt_province = pd.DataFrame(provincetempdate,columns=["PROVINCESHORTNAME","CONFIRMEDCOUNT","CUREDCOUNT","DEADCOUNT"])


各城市(部分)数据如下:

 PROVINCESHORTNAMECITYNAMECONFIRMEDCOUNTCUREDCOUNTDEADCOUNT
0 湖北 武汉 41152 3507 1309
1 湖北 孝感 3279 449 70
2 湖北 黄冈 2831 839 78
3 湖北 荆州 1501 305 37
4 湖北 鄂州 1274 244 35
5 湖北 随州 1267 140 24
6 湖北 襄阳 1155 151 20

各省分总体(部分)数据如下:

 PROVINCESHORTNAMECONFIRMEDCOUNTCUREDCOUNTDEADCOUNT
0 湖北 58182 6693 1696
1 广东 1322 524 4
2 河南 1246 509 16
3 浙江 1171 507 0
4 湖南 1006 498 3
5 安徽 973 280 6
6 江西 930 275 1
7 江苏 626 258 0
8 重庆 552 211 5
9 山东 541 191 2

转载请注明:

微信公众号:数据志

简书:数据志

博客园:https://www.cnblogs.com/wheng/

CSDN:https://blog.csdn.net/wzgl__wh
GitHub(数据、代码):https://github.com/hellowangheng/datazhi/tree/master/2019-nCoV

posted @ 2020-02-22 16:08  王亨  阅读(663)  评论(0编辑  收藏  举报