scrapy pipelines 以及 cookies

在yeild item以后，会依次通过所有的pipelines

在存在多个pipelines的class的情况的时候，如果不希望交给下一个pipeline进行处理：

1、需要导入

from scrapy.exceptions import DropItem

2、在process_item方法中抛出异常

raise DropItem（）

如果希望交给下一个pipeline处理的话：

return Item

另外：如果只希望交给某一个pipelin进行处理的时候，可以在process_item中进行判断：

if spider.name == 'chouti':       
# 可以这样判断是那个爬虫发来的数据，可分别进行操作

pipelines中一共有4个方法：

crawler.settings.get('setting中的配置文件名称且必须大写‘)

from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    def process_item(self, item, spider):
        # 操作并进行持久化

        # return表示会被后续的pipeline继续处理
        return item

        # 表示将item丢弃，不会被后续pipeline处理
        # raise DropItem()


    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候，用于创建pipeline对象
        :param crawler: 
        :return: 
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬虫开始执行时，调用
        :param spider: 
        :return: 
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬虫关闭时，被调用
        :param spider: 
        :return: 
        """
        print('111111')

可以在开始调用的时候获取数据库的连接，在结束调用的时候关闭数据库的连接，在处理过程中进行数据库操作。

cookies获取

1、首先需要导入模块

from scrapy.http.cookies import CookieJar

2、创建cookies容器，然后获取cookies，并返回cookies的值

cookie_obj = CookieJar()
cookie_obj.extract_cookies(response, request=response.request)
print(cookie_obj._cookies)    # 获取cookies值

3、自动登录抽屉，并将第一页所有内容点赞

from scrapy.http.cookies import CookieJar


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']

    cookie_dict = None

    def parse(self, response):
        cookie_obj = CookieJar()
        cookie_obj.extract_cookies(response, request=response.request)
        # print(cookie_obj._cookies)    # 获取cookies值

        self.cookie_dict = cookie_obj._cookies

        yield Request(
            url='https://dig.chouti.com/login',
            method='POST',
            body='phone=8613701055688&password=800605&oneMonth=1',
            headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            cookies=cookie_obj._cookies,
            callback=self.check_login
        )

    def check_login(self, response):
        # print('one》》》', response.text)
        yield Request(
            url='https://dig.chouti.com/',
            callback=self.good,
        )

    def good(self, response):
        print(response)
        id_list = Selector(response=response).xpath('//div[@share-linkid]/@share-linkid').extract()
        for nid in id_list:
            print(nid)
            url = 'https://dig.chouti.com/link/vote?linksId=%s' % nid
            yield Request(
                url=url,
                cookies=self.cookie_dict,
                method="POST",
                headers={'content-type': 'text/plain; charset=utf-8'},
                callback=self.show,
            )

    def show(self, response):
        print(response.text)

posted @ 2018-10-18 17:53 Trunkslisa 阅读(298) 评论(0) 编辑收藏举报

刷新页面返回顶部

Trunkslisa

scrapy pipelines 以及 cookies

公告