Scrapy的基础使用（1）

Scrapy

什么是scrapy

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

创建爬虫项目

scrapy是一个爬虫框架，创建项目与日常新建文件编写代码不同，他需要通过终端创建，在终端中输入下面语句创建项目：

scrapy startproject 项目名

创建项目方式与Django框架类似，项目名需遵守命名规范，不能以数字开头和包含中文

创建完毕后会在当前目录下生成爬虫项目，整个项目结构：

项目名
	scrapy.cfg
	项目名
		spiders目录
			__init__.py
		__init__.py
		items.py
		middlewares.py
		pipelines.py
		settings.py

这些文件分别是:

scrapy.cfg: 项目的配置文件。
scrapy_baidu_E01/: 项目的Python模块，将会从这里引用代码。
scrapy_baidu_E01/items.py: 项目的目标文件。
scrapy_baidu_E01/middlewares.py: 项目的中间件。
scrapy_baidu_E01/pipelines.py: 项目的管道文件。
scrapy_baidu_E01/settings.py: 项目的设置文件。
scrapy_baidu_E01/spiders/: 存储爬虫代码目录。

制作爬虫步骤

新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标（编写items.py）：明确你想要抓取的目标
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容（pipelines.py）：设计管道存储爬取内容

创建爬虫文件

scrapy创建爬虫文件需要在项目中的spiders目录下创建，创建的方法与创建项目类似，都要在终端下输入命令，如：

scrapy genspider baidu www.baidu.com

scrapy genspider：是固定创建爬虫文件语法
baidu：是爬虫文件名字
www.baidu.com：是爬取目标的域名网址

此时scrapy会自动在项目的spiders目录下生成baidu.py文件：

import scrapy


class BaiduSpider(scrapy.Spider):
    name = "baidu" # 爬虫的名字 用于运行爬虫的时候使用的值 必须是唯一的
    allowed_domains = ["www.baidu.com"] # 允许访问的域名
    start_urls = ["https://www.baidu.com"] # 起始url地址 指的是第一次访问的域名

    def parse(self, response): # 相当于前面response = urllib.request.urlopen()或者response = requests.get()的返回值
        pass

name = "" ：这个爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字。
allow_domains = [] 是搜索的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略。
start_urls = () ：爬取的URL元祖/列表。爬虫从这里开始抓取数据，所以，第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。
parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：负责解析返回的网页数据(response.body)，提取结构化数据(生成item)
生成需要下一页的URL请求。

执行爬虫文件

执行爬虫文件需在终端输入如下语句：

scrapy crawl 爬虫名

修改parse方法

    def parse(self, response): # 相当于前面response = urllib.request.urlopen()或者response = requests.get()的返回值
        print('---爬取返回体---')
        with open('./baidu.html', 'w', encoding='utf-8') as fp:
            fp.write(response.text)
        print("response.text信息：" + response.text)

执行爬虫文件

scrapy crawl baidu

查看打印信息

response的数据和方法

response.text：获取的是响应的字符串
response.boby：获取的是二进制数据
response.xpath：可以直接是xpath方法来解析response中内容
response.extract：提取Seletor对象的data属性值
response.extract_first：提取的Seletor列表的第一个数据

Scrapy shell

如何进去和使用scrapy shell?

在终端输入：

scrapy shell 要解析的网络域名

如：

scrapy shell www.baidu.com

个人认为scrapy shell对元素的定位会非常方便

Scrapy架构组成即工作原理（重要）

架构组成

   （1）引擎 ‐‐‐》自动运行，无需关注，会自动组织所有的请求对象，分发给下载器
   （2）下载器 ‐‐‐》从引擎处获取到请求对象后，请求数据
   （3）spiders ‐‐‐》Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。 换句话说，Spider就是您定义爬取的动作及分析某个网页(或者是有些网页)的地方。
   （4）调度器 ‐‐‐》有自己的调度规则，无需关注
   （5）管道（Item pipeline） ‐‐‐》最终处理数据的管道，会预留接口供我们处理数据当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。
    以下是item pipeline的一些典型应用：
    1. 清理HTML数据
    2. 验证爬取的数据(检查item包含某些字段)
    3. 查重(并丢弃)
    4. 将爬取结果保存到数据库中

工作原理

在scrapy框架中，spiders会判断你返回的是Rqequests还是data（Items），如果为Requests会返回给调度器，经过一系列操作回到Responses，反之会返回给管道Popeline处理。

Spiders

官方解释：https://docs.scrapy.org/en/latest/topics/spiders.html

Spiders 返回Items

spiders返回items需要现在items.py文件中定义返回的数据字段，这里直接定义就行，spiders处理好数据后直接调用该类就行。

# 用来定义管道下载的数据结构
class ScrapyDangdangE02Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 图片地址
    img = scrapy.Field()
    # 书本名字
    name = scrapy.Field()
    # 书本售价
    price = scrapy.Field()

定义好items后，直接在spider文件的parse方法返回该类就行。

    def parse(self, response):
        root_list = response.xpath('//div[@id="search_nature_rg"]/ul[@id="component_59"]/li')
        for i in root_list:
            src = i.xpath('.//img/@data-original').extract_first()
            if src is None:
                img = i.xpath('.//img/@src').extract_first()
            else:
                img = src
            # print(img)
            name = i.xpath('.//img/@alt').extract_first()
            price = i.xpath('.//p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
            print(img + "\t" + name + "\t" + price)
            # 封装数据
            book = ScrapyDangdangE02Item(img=img, name=name, price=price)
            # 返回迭代器，每来一条数据，就传送给管道
            yield book

spiders返回Requests

要想返回Requests，需要调用scrapy的Request方法，该方法有两个参数需要注意:url和callback

    def parse(self, response):
        root_list = response.xpath('//div[@id="search_nature_rg"]/ul[@id="component_59"]/li')
        for i in root_list:
            src = i.xpath('.//img/@data-original').extract_first()
            if src is None:
                img = i.xpath('.//img/@src').extract_first()
            else:
                img = src
            # print(img)
            name = i.xpath('.//img/@alt').extract_first()
            price = i.xpath('.//p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
            print(img + "\t" + name + "\t" + price)
            # 封装数据
            book = ScrapyDangdangE02Item(img=img, name=name, price=price)
            # 返回迭代器，每来一条数据，就传送给管道
            yield book

        # 获取网站下一页的按钮链接
        next_page = response.xpath('//div[@class="paging"]//ul[@name="Fy"]/li[@class="next"]/a/@href').extract_first()
        # 对下一页进行判断，如果存在就跳转导Request请求
        url = "https://category.dangdang.com/" + next_page
        if next_page is not None:
            yield scrapy.Request(url=url, callback=self.parse)

callback参数代表scrapy解析完对应的url网页数据要返回到哪个函数的responses里。

Request类里的参数还有很多，具体传参如下：

开始scrapy爬虫

1. 创建爬虫项目

scrapy startproject scrapy_dangdang_E02

2. 修改setting文件

去到爬虫项目中的setting.py文件中注释掉ROBOTSTXT_OBEY设置，这是一个君子协议，遵守这个协议后，一些网站是爬不到数据的。

# ROBOTSTXT_OBEY = True

3. 创建爬虫文件

scrapy genspider dangdang category.dangdang.com/cp01.36.01.02.00.00.html

4. 明确爬取目标

找到爬虫项目中的items.py文件，

# 用来定义管道下载的数据结构
class ScrapyDangdangE02Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 图片地址
    img = scrapy.Field()
    # 书本名字
    name = scrapy.Field()
    # 书本售价
    price = scrapy.Field()

5. 重写parse方法

进入到刚刚创建的爬虫文件中dangdang.py，编写里面的parse方法，提取response里的数据

    def parse(self, response):
        root_list = response.xpath('//div[@id="search_nature_rg"]/ul[@id="component_59"]/li')
        for i in root_list:
            src = i.xpath('.//img/@data-original').extract_first()
            if src is None:
                img = i.xpath('.//img/@src').extract_first()
            else:
                img = src
            # print(img)
            name = i.xpath('.//img/@alt').extract_first()
            price = i.xpath('.//p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
            print(img + "\t" + name + "\t" + price)
            # 封装数据
            book = ScrapyDangdangE02Item(img=img, name=name, price=price)
            # 返回迭代器，每来一条数据，就传送给管道
            yield book

6. 开启管道

spider返回items对象后，就要把数据传给管道，要想使用管道，得先在setting.py文件中开启管道的设置：

ITEM_PIPELINES = {
   # 管道可以设置多个，并且具有优先级，后面的数字范围是1-1000，数字越小，优先级越大
   "scrapy_dangdang_E02.pipelines.ScrapyDangdangE02Pipeline": 300,
}

7. 重写管道类

在pipeline.py文件中重写管道类，在process_item方法中定义管道逻辑：

# 要想使用管道，就得现在setting里设置 `ITEM_PIPELINES` 开启管道
class ScrapyDangdangE02Pipeline:
    def open_spider(self,spider):
        '''
        管理Scrapy的Pipeline类生命周期的开始方法，该方法只会调用一次
        :param spider:
        :return:
        '''
        self.fp = open('./dangdang.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        '''
        每返回一个item，都会调用一次该方法
        :param item:
        :param spider:
        :return:
        '''
        self.fp.write(str(item))
        return item
    def close_spider(self,spider):
        '''
        Pipeline类的关闭方法，当爬虫文件运行完会自动调用方法
        :param spider:
        :return:
        '''
        self.fp.close()

open_spider和close_spider方法是管道类一开始没有的，需自己重写，管道类也只能重写这两个方法，重写后，scrapy框架会分别会在程序开启、结束自动调用这两个方法。也可以用__init__和__del__来管理类的生命周期的，实现相同的功能。

项目实战：爬取当当网数据

完整代码：

items.py:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

# 用来定义管道下载的数据结构
class ScrapyDangdangE02Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 图片地址
    img = scrapy.Field()
    # 书本名字
    name = scrapy.Field()
    # 书本售价
    price = scrapy.Field()

dangdang.py:

import scrapy
from scrapy_dangdang_E02.items import ScrapyDangdangE02Item


class DangdangSpider(scrapy.Spider):
    name = "dangdang"
    allowed_domains = ["category.dangdang.com"]
    start_urls = ["https://category.dangdang.com/cp01.36.01.02.00.00.html"]

    def parse(self, response):
        root_list = response.xpath('//div[@id="search_nature_rg"]/ul[@id="component_59"]/li')
        for i in root_list:
            src = i.xpath('.//img/@data-original').extract_first()
            if src is None:
                img = i.xpath('.//img/@src').extract_first()
            else:
                img = src
            # print(img)
            name = i.xpath('.//img/@alt').extract_first()
            price = i.xpath('.//p[@class="price"]/span[@class="search_now_price"]/text()').extract_first()
            print(img + "\t" + name + "\t" + price)
            # 封装数据
            book = ScrapyDangdangE02Item(img=img, name=name, price=price)
            # 返回迭代器，每来一条数据，就传送给管道
            yield book

        # 获取网站下一页的按钮链接
        next_page = response.xpath('//div[@class="paging"]//ul[@name="Fy"]/li[@class="next"]/a/@href').extract_first()
        # 对下一页进行判断，如果存在就跳转导Request请求
        url = "https://category.dangdang.com/" + next_page
        if next_page is not None:
            # 重新返回给调度器进行网页解析
            yield scrapy.Request(url=url, callback=self.parse)

pipeline.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

# 要想使用管道，就得现在setting里设置 `ITEM_PIPELINES` 开启管道
class ScrapyDangdangE02Pipeline:
    def open_spider(self,spider):
        '''
        管理Scrapy的Pipeline类生命周期的开始方法，该方法只会调用一次
        :param spider:
        :return:
        '''
        self.fp = open('./dangdang.json','w',encoding='utf-8')
    def process_item(self, item, spider):
        '''
        每返回一个item，都会调用一次该方法
        :param item:
        :param spider:
        :return:
        '''
        self.fp.write(str(item))
        return item
    def close_spider(self,spider):
        '''
        Pipeline类的关闭方法，当爬虫文件运行完会自动调用方法
        :param spider:
        :return:
        '''
        self.fp.close()

setting.py:

# Scrapy settings for scrapy_dangdang_E02 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_dangdang_E02"

SPIDER_MODULES = ["scrapy_dangdang_E02.spiders"]
NEWSPIDER_MODULE = "scrapy_dangdang_E02.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_dangdang_E02 (+http://www.yourdomain.com)"

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_dangdang_E02.middlewares.ScrapyDangdangE02SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_dangdang_E02.middlewares.ScrapyDangdangE02DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 管道可以设置多个，并且具有优先级，后面的数字范围是1-1000，数字越小，优先级越大
   "scrapy_dangdang_E02.pipelines.ScrapyDangdangE02Pipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

这个项目中使用的是get请求，后续还有post请求，另外仔细观察scrapy的工作原理会发现数据会经过一个中间件，目前工作需求并不复杂，不会使用到，后续在详细了解下。

多条管道处理

以上面实战为例，直接在管道文件pipeline.py文件中仿造管道类编写一个就行

import urllib.request as ur
class ScrapyDangdangBookImgPipeline:
    def process_item(self, item, spider):
        img = item.get('img')
        name = item.get('name')
        url = 'https:' + img
        ur.urlretrieve(url=url, filename='./book/' + name + '.jpg')
        return item

然后在settings.py文件中添加好管道配置信息

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 管道可以设置多个，并且具有优先级，后面的数字范围是1-1000，数字越小，优先级越大
   "scrapy_dangdang_E02.pipelines.ScrapyDangdangE02Pipeline": 300,
   "scrapy_dangdang_E02.pipelines.ScrapyDangdangBookImgPipeline":200,
}

跨页取数下载

前面的项目都是获取当前页的数据再下载，如果我想要的数据即在第一页，又在第二页，那要怎么获取呢？比如招聘网站，当前页为各岗位的简要招聘信息，如：公司规模、公司名、岗位简要要求、岗位福利...当点击目标岗位才会跳转到岗位详情页，里面有详细的岗位要求、任职资格。面对这种情况我们要怎么把数据一起下载下来呢？

不知大家还是否记得scrapy文件里的Request类参数：其中的meta可以解决这个问题

callback:表示当前url的响应结果交给哪个函数处理
meta：实现数据在不同解析函数中的传递
method：指定POST或GET请求
body：为POST请求的参数信息
dont_filter：默认为False，表示不会对重复对请求过的url再请求
errback：表示当前url响应错误返回哪个函数处理

项目实战：爬取某网站下的壁纸

items.py：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyDy221E03Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_name = scrapy.Field()
    img_path = scrapy.Field()

spiders.py：

import scrapy

from scrapy_dy221_E03.items import ScrapyDy221E03Item


class BizhiSpider(scrapy.Spider):
    name = "bizhi"
    allowed_domains = ["pic.netbian.com"]
    start_urls = ["https://pic.netbian.com/index.html"]

    def parse(self, response):
        # print('----------------')
        root_list = response.xpath('//div[@class="slist"]/ul[@class="clearfix"]/li')
        for i in root_list:
            img_href = i.xpath('./a/@href').extract_first()
            img_path = 'https://pic.netbian.com' + img_href
            img_name = i.xpath('./a/@title').extract_first()
            # print(img_name + '：' + img_path)
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0'
            }
            yield scrapy.Request(url=img_path,callback=self.parse_one,meta={'name':img_name},headers=headers)

    def parse_one(self,response):
        img = response.xpath('//*[@id="img"]/img/@src').extract_first()
        img = 'https://pic.netbian.com' + img
        img_obect = ScrapyDy221E03Item(img_path=img, img_name=response.meta['name'])
        yield img_obect

scrapy.Request里的meta参数可以给下一个spider方法携带信息，值得注意的是：meta的参数格式为字典，调用的话引用response.meta方法

settings.py：

# Scrapy settings for scrapy_dy221_E03 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_dy221_E03"

SPIDER_MODULES = ["scrapy_dy221_E03.spiders"]
NEWSPIDER_MODULE = "scrapy_dy221_E03.spiders"

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "scrapy_dy221_E03.pipelines.ScrapyDy221E03Pipeline": 300,
}
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0'
}

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

pipeline.py：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

import urllib.request as ur
class ScrapyDy221E03Pipeline:
    def process_item(self, item, spider):
        # 这里有一个小反爬操作，目标网站会对urllib库进行拦截，需要添加请求头
        opener = ur.build_opener()
        opener.addheaders = [('User-Agent',
                              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0')]
        ur.install_opener(opener)
        print('正在下载：' + item['img_name'] + '图片', '图片地址：' + item['img_path'])
        ur.urlretrieve(url=item['img_path'],filename='./img/' + item['img_name'] + '.jpg')
        return item

这里有一个对urllib库的反爬手段，需要开启build_opener添加个请求头，不加会报：HTTPError: HTTP Error 403: Forbidden

posted @ 2024-04-16 11:49 MrSponge Views(12) Comments(0) Edit 收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Mr-Sponge

道阻且长啊~~~