Scrapy 入门：爬虫类详解（Parse()函数、选择器、提取数据）

安装 & 创建项目

# 安装Scrapy
pip install scrapy
# 创建项目
scrapy startproject tutorial # tutorial为项目名
# 创建爬虫
scrapy genspider <爬虫名> <domain.com>

得到的目录结构如下：

tutorial/
    scrapy.cfg            # 配置文件
    tutorial/             # 项目的模块
        __init__.py
        items.py          # 定义items
        middlewares.py    # 中间件
        pipelines.py      # pipelines
        settings.py       # 设置文件
        spiders/          # 爬虫
            __init__.py
            spider1.py
            ...

爬虫类

爬虫类必须继承 scrapy.Spider，爬虫类中必要的属性和方法：

1. name = "quotes"：爬虫名，必须唯一，因为需要使用 scrapy crawl "爬虫名" 命令用来开启指定的爬虫。

2. start_requests()：要求返回一个 requests 的列表或生成器，爬虫将从 start_requests() 提供的 requests 中爬取，例如：

# start_requests()
def start_requests(self):
    urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

3. parse()：用于处理每个 Request 返回的 Response 。parse() 通常用来将 Response 中爬取的数据提取为数据字典，或者过滤出 URL 然后继续发出 Request 进行进一步的爬取。

# parse()
def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

4. start_urls 列表：可以在爬虫类中定义一个名为 start_urls 的列表替代 start_requests() 方法。作用同样是为爬虫提供初始的 Requests，但代码更加的简洁。

运行爬虫后，名为 parse() 的方法将会被自动调用，用来处理 start_url 列表中的每一个 URL：

start_urls = [
    'http://quotes.toscrape.com/page/1/',
    'http://quotes.toscrape.com/page/2/',
]

5. 运行爬虫：

$ scrapy crawl quotes

运行爬虫时发生了什么：Scrapy 通过爬虫类的 start_requests 方法返回 scrapy.Request 对象。在接收到每个 response 响应时，它实例化 Response 对象并调用与 request 相关的回调方法（ parse 方法），并将 Response 作为其参数传递。

parse() 函数

parse() 函数无疑是爬虫类中最重要的函数，它包含了爬虫解析响应的主要逻辑。

学习使用 Scrapy 选择器的最佳方法就是使用 Scrapy shell，输入这个命令之后将会进入一个交互式的命令行模式：

scrapy shell 'http://quotes.toscrape.com/page/1/'

下面将通过交互式命令实践来学习 Response 选择器：

CSS 选择器

response.css 返回的是一个 SelectorList 对象，它是一个Selector 对象构成的列表，例如：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

用 getall() 方法获取所有符合条件的字符串列表，用 get() 获取首个匹配的字符串。::text 用于去除标签(<tag>)。

>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'

使用 re() 相当于在 getall() 的基础上用正则表达式对内容进一步筛选

>>> response.css('title::text').re(r'Q\w+')
['Quotes']

XPath 选择器

XPath 选择器相较于 CSS 选择器更加强大。实际上在 Scrapy 内部，CSS 选择器最终会被转换成 XPath 选择器。

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

生成数据字典

要将 Response 中爬取的数据生成为数据字典，使用字典生成器，例如：

def parse(self, response):
    for quote in response.css('div.quote'):  # quote是SelectorList对象
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

存储数据到文件

最简单的方法是用 Feed exports。使用 -o 参数指定一个 json 文件用于存储 parse() 函数 yield 出的内容。

$ scrapy crawl quotes -o quotes.json -s FEED_EXPORT_ENCODING=utf-8
# 若有中文务必加上 -s FEED_EXPORT_ENCODING=utf-8

使用 JSON Lines 格式存储。由于历史原因，Scrapy 只会追加而非覆盖原先的 Json 文件，会导致第二次写入后 Json 格式被破坏。而使用 JSON Lines 格式 ( .jl )可以避免这个问题

$ scrapy crawl quotes -o quotes.jl

要对数据进行更多的操作（例如验证爬到的数据，去重等等），可以在 pipelines.py 中写一个 Item Pipeline。当然，如果只需要存储爬取到的数据则不需要。

提取 URL 进行深层爬取

例如要提取出下一页的 URL 地址进行进一步的爬取：

<li class="next">
    <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a> <!-- &rarr;表示右箭头 -->
</li>

通过以下两种方式都可以提取出 <a> 标签中的 href 属性：

>>> response.css('li.next a::attr(href)').get()
'/page/2/'
>>> response.css('li.next a').attrib['href']
'/page/2'

当在 parse() 中 yield 出的是一个 Request 对象时，Scrapy 会自动安排发送这个 request，当请求完成后继续调用 callback 参数所指定的回调函数，如下所示：

def parse(self, response):
    for quote in response.css('div.quote'):  # quote是SelectorList对象
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

    next_page = response.css('li.next a::attr(href)').get()
    if next_page is not None:
        next_page = response.urljoin(next_page)  # urljoin()方法可以自动将相对路径转换为绝对路径
        yield scrapy.Request(next_page, callback=self.parse)  # yield scrapy.Request()

response.follow()

建议使用更方便的 response.follow() 替代 scrapy.Request()，因为它直接支持相对路径，上文中代码可以简化如下：

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse)  # next_page = '/page/2/'

response.follow() 还支持直接使用 Selector 对象作为参数，无需提取出 URL，于是上述代码得到进一步简化：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)  # href = [<Selector xpath='' data=''>]

注意 SelectorList 对象不能直接作为参数，下面的用法是错误的：
yield response.follow(response.css('li.next a::sattr(href)'), callback=self.parse)

针对 <a> 标签的 css 选择器，response.follow() 会自动使用其 href 属性，于是上述代码终极简化版本如下所示：

# CSS选择器
for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

但是注意 XPath 选择器不能这么简写：

# 不能简化成 //div[@class='p_name']/a
for a in response.xpath("//div[@class='p_name']/a/@href"):
    yield response.follow(a, callback=self.parse)

默认情况下，Scrapy 会帮我们过滤掉重复访问的地址，可以通过 DUPEFILTER_CLASS Setting 设置。

scrapy crawl 附带参数

使用 -a 选项来给爬虫提供额外的参数，提供的参数会自动变成爬虫类的属性（使用 self.tag 或 getattr(self, 'tag', None) 获取），如下例，使用 -a tag=humor 命令行参数，最终数据将保存到 quotes-humor.json 文件：

$ scrapy crawl quotes -o quotes-humor.json -a tag=humor

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

posted @ 2020-08-19 12:58 x0c 阅读(7872) 评论(0) 编辑收藏举报

刷新页面返回顶部

Loading

x0c