scrapy之Pipeline

官方文档：https://docs.scrapy.org/en/latest/topics/item-pipeline.html

　　激活pipeline，需要在settings里配置，然而这里配置的pipeline会作用于所有的spider。加入项目中有很多spider在运行。item pipeline的处理就会很麻烦，你可以通过process_item(self,item,spider)中的spider参数来判断是来自哪个爬虫，但是这种方法很冗余。更好的做法是配置spider类中的custom_settings属性。为每一个spider配置不同的pipeline。示例如下：

　　同时，这里你也会看到custom_settings的用法和用处。

class XiaohuaSpider(scrapy.Spider):
    name = 'xiaohua'
    custom_settings = {
        'ITEM_PIPELINES ':{
            'TB.pipelines.TBMongoPipeline':300,
        }
    }

一 method

　　1 process_item(self,item,spider)

　　This method is called for every item pipeline component

　　2 open_spider(self,spider)

　　This method is called when the spider is opened.

　　3 close_spider(self,spider)

　　4 from_crawler(cls,crawler)

　　It must return a new instance of the pipeline

二 Item Pipeline example

　　1 write items to mongodb

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

　　2 duplicates filter

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

posted @ 2018-04-18 19:27 骑者赶路阅读(338) 评论(0) 编辑收藏举报

刷新页面返回顶部

scrapy之Pipeline

公告