CrawlSpider 用法（页面链接提取解析例如：下一页）

创建基于CrawlSpider的爬虫文件

　　scrapy genspider -t crawl 爬虫名称链接

注意follow参数

例1：follow = False

spider/chouti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 实例化一个链接提取器对象
    # 链接提取器：用来提取指定的链接（url）
    # allow参数：赋值一个正则表达式
    # 链接提取器可以根据正则表达式在页面中提取指定的链接
    # 提取到的链接会全部交给规则解析器
    link = LinkExtractor(allow=r'/all/hot/recent/\d+')
    rules = (
        # 实例话一个规则解析器
        # 规则解析器在接收链接提起器发送的链接后，就会对链接发起请求，获取链接对应的页面内容
        # callback:指定一个解析规则（方法/函数）
        # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
        Rule(link, callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        print(response)

执行结果：没有允许链接提取器继续在提取到的链接中继续作用

C:\Users\Administrator\PycharmProjects\new\CrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/5>

例2：

follow = True

spider/chouti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 实例化一个链接提取器对象
    # 链接提取器：用来提取指定的链接（url）
    # allow参数：赋值一个正则表达式
    # 链接提取器可以根据正则表达式在页面中提取指定的链接
    # 提取到的链接会全部交给规则解析器
    link = LinkExtractor(allow=r'/all/hot/recent/\d+')
    rules = (
        # 实例话一个规则解析器
        # 规则解析器在接收链接提起器发送的链接后，就会对链接发起请求，获取链接对应的页面内容
        # callback:指定一个解析规则（方法/函数）
        # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

执行结果

C:\Users\Administrator\PycharmProjects\new\CrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/5>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/13>
<200 https://dig.chouti.com/all/hot/recent/14>
<200 https://dig.chouti.com/all/hot/recent/11>
<200 https://dig.chouti.com/all/hot/recent/12>
<200 https://dig.chouti.com/all/hot/recent/16>
<200 https://dig.chouti.com/all/hot/recent/17>
<200 https://dig.chouti.com/all/hot/recent/15>
<200 https://dig.chouti.com/all/hot/recent/18>
<200 https://dig.chouti.com/all/hot/recent/20>
<200 https://dig.chouti.com/all/hot/recent/19>
<200 https://dig.chouti.com/all/hot/recent/22>
<200 https://dig.chouti.com/all/hot/recent/21>
<200 https://dig.chouti.com/all/hot/recent/24>
<200 https://dig.chouti.com/all/hot/recent/23>
<200 https://dig.chouti.com/all/hot/recent/26>
<200 https://dig.chouti.com/all/hot/recent/25>
<200 https://dig.chouti.com/all/hot/recent/28>
<200 https://dig.chouti.com/all/hot/recent/27>
<200 https://dig.chouti.com/all/hot/recent/30>
<200 https://dig.chouti.com/all/hot/recent/29>
<200 https://dig.chouti.com/all/hot/recent/31>
<200 https://dig.chouti.com/all/hot/recent/32>
<200 https://dig.chouti.com/all/hot/recent/33>
<200 https://dig.chouti.com/all/hot/recent/34>
<200 https://dig.chouti.com/all/hot/recent/37>
<200 https://dig.chouti.com/all/hot/recent/36>
<200 https://dig.chouti.com/all/hot/recent/38>
<200 https://dig.chouti.com/all/hot/recent/35>
<200 https://dig.chouti.com/all/hot/recent/40>
<200 https://dig.chouti.com/all/hot/recent/41>
<200 https://dig.chouti.com/all/hot/recent/39>
<200 https://dig.chouti.com/all/hot/recent/42>
<200 https://dig.chouti.com/all/hot/recent/45>
<200 https://dig.chouti.com/all/hot/recent/43>
<200 https://dig.chouti.com/all/hot/recent/44>
<200 https://dig.chouti.com/all/hot/recent/46>
<200 https://dig.chouti.com/all/hot/recent/49>
<200 https://dig.chouti.com/all/hot/recent/48>
<200 https://dig.chouti.com/all/hot/recent/47>
<200 https://dig.chouti.com/all/hot/recent/50>
<200 https://dig.chouti.com/all/hot/recent/51>
<200 https://dig.chouti.com/all/hot/recent/52>
<200 https://dig.chouti.com/all/hot/recent/53>
<200 https://dig.chouti.com/all/hot/recent/54>
<200 https://dig.chouti.com/all/hot/recent/55>
<200 https://dig.chouti.com/all/hot/recent/56>
<200 https://dig.chouti.com/all/hot/recent/58>
<200 https://dig.chouti.com/all/hot/recent/57>
<200 https://dig.chouti.com/all/hot/recent/60>
<200 https://dig.chouti.com/all/hot/recent/59>
<200 https://dig.chouti.com/all/hot/recent/61>
<200 https://dig.chouti.com/all/hot/recent/62>
<200 https://dig.chouti.com/all/hot/recent/64>
<200 https://dig.chouti.com/all/hot/recent/63>
<200 https://dig.chouti.com/all/hot/recent/65>
<200 https://dig.chouti.com/all/hot/recent/66>
<200 https://dig.chouti.com/all/hot/recent/68>
<200 https://dig.chouti.com/all/hot/recent/67>
<200 https://dig.chouti.com/all/hot/recent/69>
<200 https://dig.chouti.com/all/hot/recent/70>
<200 https://dig.chouti.com/all/hot/recent/71>
<200 https://dig.chouti.com/all/hot/recent/72>
<200 https://dig.chouti.com/all/hot/recent/73>
<200 https://dig.chouti.com/all/hot/recent/74>
<200 https://dig.chouti.com/all/hot/recent/75>
<200 https://dig.chouti.com/all/hot/recent/76>
<200 https://dig.chouti.com/all/hot/recent/78>
<200 https://dig.chouti.com/all/hot/recent/77>
<200 https://dig.chouti.com/all/hot/recent/79>
<200 https://dig.chouti.com/all/hot/recent/80>
<200 https://dig.chouti.com/all/hot/recent/82>
<200 https://dig.chouti.com/all/hot/recent/81>
<200 https://dig.chouti.com/all/hot/recent/84>
<200 https://dig.chouti.com/all/hot/recent/83>
<200 https://dig.chouti.com/all/hot/recent/85>
<200 https://dig.chouti.com/all/hot/recent/86>
<200 https://dig.chouti.com/all/hot/recent/87>
<200 https://dig.chouti.com/all/hot/recent/88>
<200 https://dig.chouti.com/all/hot/recent/89>
<200 https://dig.chouti.com/all/hot/recent/90>
<200 https://dig.chouti.com/all/hot/recent/91>
<200 https://dig.chouti.com/all/hot/recent/92>
<200 https://dig.chouti.com/all/hot/recent/94>
<200 https://dig.chouti.com/all/hot/recent/93>
<200 https://dig.chouti.com/all/hot/recent/96>
<200 https://dig.chouti.com/all/hot/recent/95>
<200 https://dig.chouti.com/all/hot/recent/98>
<200 https://dig.chouti.com/all/hot/recent/97>
<200 https://dig.chouti.com/all/hot/recent/100>
<200 https://dig.chouti.com/all/hot/recent/99>
<200 https://dig.chouti.com/all/hot/recent/102>
<200 https://dig.chouti.com/all/hot/recent/101>
<200 https://dig.chouti.com/all/hot/recent/103>
<200 https://dig.chouti.com/all/hot/recent/104>
<200 https://dig.chouti.com/all/hot/recent/105>
<200 https://dig.chouti.com/all/hot/recent/106>
<200 https://dig.chouti.com/all/hot/recent/107>
<200 https://dig.chouti.com/all/hot/recent/108>
<200 https://dig.chouti.com/all/hot/recent/109>
<200 https://dig.chouti.com/all/hot/recent/110>
<200 https://dig.chouti.com/all/hot/recent/111>
<200 https://dig.chouti.com/all/hot/recent/112>
<200 https://dig.chouti.com/all/hot/recent/113>
<200 https://dig.chouti.com/all/hot/recent/114>
<200 https://dig.chouti.com/all/hot/recent/115>
<200 https://dig.chouti.com/all/hot/recent/116>
<200 https://dig.chouti.com/all/hot/recent/118>
<200 https://dig.chouti.com/all/hot/recent/117>
<200 https://dig.chouti.com/all/hot/recent/119>
<200 https://dig.chouti.com/all/hot/recent/120>

注意：

　　如果后续对爬取的页面数据进行处理，用xpath获取数据，yield到管道再进行相应的存储操作

posted @ 2018-12-19 17:47 Corey0606 阅读(234) 评论(0) 收藏举报

刷新页面返回顶部

COREY

CrawlSpider 用法（页面链接提取解析例如：下一页）

创建基于CrawlSpider的爬虫文件

scrapy genspider -t crawl 爬虫名称链接

公告

COREY

CrawlSpider 用法（页面链接提取解析 例如：下一页）

创建基于CrawlSpider的爬虫文件

scrapy genspider -t crawl 爬虫名称 链接

公告

CrawlSpider 用法（页面链接提取解析例如：下一页）

　　scrapy genspider -t crawl 爬虫名称链接