寒假大数据学习笔记十三

　　今天接着昨天，写出了一个crawlspider爬取山西省卫健委官网数据的小例子，当然依旧是json数据存储，并且也没有直接做成数据，只是字符串。

　　爬取的还算成功，但中间出了一点岔子：在最近两天的官网公布疫情感染人数上不再是写出来了，而是直接放图片！！！你说要是表格也就算了，山西省卫健委直接将一张图片上传上去了……这让我怎么分析数据？我上网找了找，倒是找见一个关于图像识别表格数据的博客，但明显对于我来说短时间之内是完不成的，这还真是够麻爪的……算了，一步步来，先把相关数据爬取下来，需要处理时再说，大不了换一个网站爬取嘛。代码如下：

spider.py:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SxgzbdUpgradeSpider(CrawlSpider):
    name = 'sxgzbd_upgrade'
    allowed_domains = ['wjw.shanxi.gov.cn']
    start_urls = ['http://wjw.shanxi.gov.cn/yqfbl05/index.hrh']

    rules = (
        Rule(
            LinkExtractor(
                allow=r'.+yqfbl05/index(_\d)?\.hrh'),
            follow=True),
        Rule(
            LinkExtractor(
                allow=r'.+wjywl02/(\d){5}\.hrh'),
            callback="parse_detail",
            follow=False),
    )

    def parse_detail(self, response):
        item = {}
        contexts = response.xpath(
            "//div[@class='ze-art']//text()").getall()
        item['date'] = response.xpath(
            "//div[@class='artxx']/text()").re(r"2020-\d\d-\d\d")[0]
        item['contexts'] = "".join(contexts).strip()
        yield item

Items.py:

import scrapy


class SpiderDemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    contexts = scrapy.Field()

piplines.py:

from scrapy.exporters import JsonLinesItemExporter


class SpiderDemoPipeline(object):
    def __init__(self):
        self.fp = open("gzbddata.json", "wb")
        self.exporter = JsonLinesItemExporter(
            self.fp, ensure_ascii=False, encoding="utf-8")

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.fp.close()

记录我踩过的坑：

1、Setting中一定要开启ITEM_PIPELINES，把它的注释号删掉，不然是不会运行PIPELINE文件的，我快折腾了一个小时了，才猛然发现没开开PIPELINE，怪不得没有文件生成。

2、一定要测试好正则表达式有没有问题，不然从一开始就会什么数据都没有。

3、同样在Setting中，ROBOTSTXT_OBEY要False，既然想要学习爬虫，就一定不要安分守己（笑）。

posted @ 2020-02-14 22:12 一夕思醉阅读(117) 评论(0) 编辑收藏举报

刷新页面返回顶部

一夕思醉

寒假大数据学习笔记十三

公告