CrawlSpider的学习

　　小编，最近也是学习Scrapy的框架，一头雾水。

　　记录一下误区。

　　Scrapy在创建的时候，会有spiders的文件夹，在里面创建我们写的页面分析的代码。

　　在这个文件里面，类属性中，有name, allowed_domains, start_urls。（下方，小编修改过）

　　start_urls里面存放的我们的请求队列，之后会被调度器scheduler调用，组成队列，等待下载器的下载。

name = 'book.zongheng'
    # allowed_domains = ['https://book.zongheng.com']
    start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html',
                  'http://book.zongheng.com/store/c0/c0/b0/u0/p6/v9/s1/t0/u0/i1/ALL.html']

　　当然，也可以重写请求url队列，通过 start_requests(self):

    def start_requests(self):
        max_page = 6
        for i in range(1, max_page):
            url = 'http://book.zongheng.com/store/c0/c0/b0/u0/p' + str(i) + '/v9/s1/t0/u0/i1/ALL.html'
            yield Request(url)

　　返回的request也可以加上callback，也就是回调函数。此时，请求url会发生变化，原来的start_url就会不起作用。

　　CrawlSpider则是可以对url进行过滤。这些爬取规则由一个专门的数据结构Rule来进行表示，Rule里面包含着提取和跟进页面的配置，Spider会根据Rule来确定当前的页面中有那些链接需要继续爬取。

　　小编想，定义两个item。以小说网站为例，第一个item存储每个小说的name和rank(排名)（另一个页面），第二个item存储小说的作者，名字，类型（第一个页面），利用Rule可以将当前页面的所有的小说链接提取出来，并且自动生成Request，在调用哟个专门处理的parse解析函数获取数据，换页的处理用start_requests。

rules = (
        # 书的详情页
        Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html'), callback='parse_item',
             follow=True,
            ),
        # 书的列表
        # Rule(LinkExtractor(allow=r'http://book.zongheng.com/store/c0/c0/b0/u0/p\d+/v9/s1/t0/u0/i1/ALL.html'),
        #      callback='parse_list_item',
        #      follow=True,
        #      ),
    )

    # def process_link(self, links):# 匹配到的url进行guolv
    #     for link in links:
    #         print(link)

    # def process_request(self): # request进行过滤
    #     pass
　　# 用来提取每个小说的name，rank，作为Rule的回调函数
    def parse_item(self, response):
        item = BookInfoItem()
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        item['book_name'] = response.xpath('//div[@class="book-name"]/text()').extract_first().replace('\r\n', '').strip()
        item['book_rank'] = response.xpath('//i[@class="a1"]/text()').extract_first()
        yield item


   # 这个parse用来处理每一个页面中的 众多小说的数据
    def parse_list_url(self, response):
        item = BookListItem()
        item['book_name'] = response.xpath('//div[@class="bookname"]/a/text()').extract()
        item['book_auth'] = response.xpath('//div[@class="bookilnk"]/a[1]/text()').extract()
        item['book_type'] = response.xpath('//div[@class="bookilnk"]/a[2]/text()').extract()
        yield item

        # max_page = 6
        # for i in range(1, max_page):
        #     url = 'http://book.zongheng.com/store/c0/c0/b0/u0/p' + str(i) + '/v9/s1/t0/u0/i1/ALL.html'
        #     yield Request(url, callback=self.parse_list_item)

    def start_requests(self):
        max_page = 6
        for i in range(1, max_page):
            url = 'http://book.zongheng.com/store/c0/c0/b0/u0/p' + str(i) + '/v9/s1/t0/u0/i1/ALL.html'
            yield Request(url)

　　小编把name,rank存到MySql里面，另一个存到txt文件里面。但这样写，请求有257次，包括大页面的请求5次和每部小说的请求252次。但是txt文件里面不会有数据。这是因为并没有回调函数的使用。

于是小编加上了回调函数：

    def start_requests(self):
        max_page = 6
        for i in range(1, max_page):
            url = 'http://book.zongheng.com/store/c0/c0/b0/u0/p' + str(i) + '/v9/s1/t0/u0/i1/ALL.html'
            yield Request(url, callback=self.parse_list_item)

　　再次请求时，却只有5次大页面的请求，可能是加上callback后，下载内容直接交给对应的spider解析。

　　后来，小编查阅资料，发现，CrawlSpider提供了非常重要的一个方法，parse_start_url。——当start_urls里卖弄对应的Request得到的Response时，该方法被调用，他会分析Response并返回item或者Request对象（被重新放到Request队列等到被请求）。只需要佳能上面的parse_list_item方法改成下述名字即可重写。

 def parse_start_url(self, response):
        item = BookListItem()
        item['book_name'] = response.xpath('//div[@class="bookname"]/a/text()').extract()
        item['book_auth'] = response.xpath('//div[@class="bookilnk"]/a[1]/text()').extract()
        item['book_type'] = response.xpath('//div[@class="bookilnk"]/a[2]/text()').extract()
        yield item

posted @ 2020-02-18 14:59 为红颜阅读(188) 评论(0) 收藏举报

刷新页面返回顶部

一笑为红颜

CrawlSpider的学习

公告