Scrapy (一) 基础使用

入门

1.安装第三方包

pip install Scrapy

2.创建项目

# 生成文件夹 scrapydemo1
PS E:\PyProject> scrapy startproject scrapydemo1
New Scrapy project 'scrapydemo1', using template directory 'E:\code_tool\python\lib\site-packages\scrapy\templates\project', created in:
    E:\PyProject\scrapydemo1

You can start your first spider with:
    cd scrapydemo1
    scrapy genspider example example.com


PS E:\PyProject> cd scrapydemo1
PS E:\PyProject\scrapydemo1> ls


# 文件目录结果如下:
    目录: E:\PyProject\scrapydemo1


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         2022/12/4     10:43                scrapydemo1   # 项目的模块,引入的时候需要从这里引入
-a----         2022/12/4     10:43            265 scrapy.cfg # Scrapy部署时的文件配置


PS E:\PyProject\scrapydemo1> cd .\scrapydemo1\
PS E:\PyProject\scrapydemo1\scrapydemo1> ls


    目录: E:\PyProject\scrapydemo1\scrapydemo1


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         2022/12/4     10:42                spiders
-a----         2022/12/4     10:43            267 items.py
-a----         2022/12/4     10:43           3658 middlewares.py
-a----         2022/12/4     10:43            365 pipelines.py
-a----         2022/12/4     10:43           3306 settings.py
-a----         2022/12/4     10:42              0 __init__.py

3.创建Spider

进入刚创建的scrapydemo1文件夹,使用genspider命令,第一个参数是Spider名称
第二个是网站名称

PS E:\PyProject\scrapydemo1\scrapydemo1> scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  scrapydemo1.spiders.quotes

quotes.py 默认内容:

import scrapy

'''

属性介绍:
* name: 是每个项目的名字,用来区分不同的Spider,必须唯一。如果该Spider爬取单个网站,一个常见的做法是以该网站的域名来命名Spider,
  例如爬取mywebsite.com,该Spider通常会被命名为mywebsite
* allowed_domains: 是允许爬取的域名,如果初始或后续的请求链接不是这个域名下的,则请求链接会被过滤掉
* start_urls: 包含了Spider在启动时爬取的URL列表,初始请求是由它来定义的
* parse 是Spider的一个方法。在默认情况下,start_url里面的链接构成的请求完成下载后,parse方法就会被调用,
  返回的响应就会作为唯一的参数参数给parse方法。该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求
'''
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

3.创建Item

项目默认帮我们创建的item:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Scrapydemo1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

4.解析Response

import scrapy
from scrapydemo1.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['https://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotesItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['targs'] = quote.css('.targs::text').extract_first()
            yield item

            next = response.css('.pager .next a::attr("href")').extract_first()
            url = response.urljoin(next)  # 该方法将一个相对路径构建为一个绝对路径,如将 /page/2/ 构建为https://quotes.toscrape.com/page/2
            yield scrapy.Request(url=url,callback=self.parse)  # 构建一个请求,请求下一页数据,callback指定回调方法,从而实现循环请求爬取数据

运行:

scrapy crawl quotes

部分数据:

将爬取数据保存到文件:

  • json 文件
scrapy crawl quotes -o quotes.json

  • jl文件,jl为jsonline的缩写,让每个item输出一行json
# 保存为csv
scrapy crawl quotes -o quotes.csv
# 保存为xml
scrapy crawl quotes -o quotes.xml

入门遇到的问题汇总:
1.

  • ModuleNotFoundError: No module named 'attrs':
pip install attrs --upgrade
2022-12-04 21:19:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://quotes.toscrape.com/robots.txt> (failed 1 times): TCP connection timed out: 10
060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。.

# 原因:网络环境不稳定,多尝试几次

Item Pipeline 将数据保存到mongoDB

如果想进行更复杂的操作,如将结果保存到MongoDB数据库中或者筛选某些有用的item,那么可以定义item pipeline来实现

当item生成后,它会自动被送到item pipeline处进行处理,我们可以用item pipeline来做如下操作:

  • 清洗HTML数据
  • 验证爬取数据,检查爬取字段
  • 查重并丢弃重复内容
  • 将爬取结果存储到数据库

要实现item pipeline 很简单,只需要定义一个类并实现process_item方法即可。item pipeline会自动调用这个方法。process_item
方法必须返回包含数据的字典或item对象,或者抛出DropItem异常

process_item 方法有两个参数:

  • item:每次Spider生成的item都会作为参数传递过来
  • spider: 这就是spider实例

以下是实现数据存储到mongo的步骤:

  1. 实现item pipeline class,需要定义在 pipelines.py 文件中:
from scrapy.exceptions import DropItem
import pymongo


class TextPipeline():
    '''
    处理item的属性
    '''

    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
        else:
            return DropItem('Missing text')  # 如果不存在抛出DropItem异常


class MongoDBPipeline():
    def __init__(self, connecion, database):
        self.connection = connecion
        self.database = database

    @classmethod
    def from_crawler(cls, crawler):  # 利用crawler 获取全局配置
        return cls(crawler.settings.get('MONGODB_CONNECTION'),
                   crawler.settings.get('MONDB_DATABASE'))

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.connection)
        self.db = self.client[self.database]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert_one(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()
  • from_crawler:一个类方法,这个方法是依赖注入的方式实现的。通过crawler,我们能拿到全局配置的每个信息,从而获取数据库的
    配置信息,拿到之后返回类对象即可
  • open_spider:当Spider开启时,这个方法被调用,主要进行一些初始化的操作
  • close_spider:当Spider被关闭时,这个方法被调用,将数据库连接关闭

2.settings.py加入数据库配置信息

# mogo配置
MONGODB_CONNECTION='localhost'
MONDB_DATABASE='scrapydemo1'

3.settings.py 加入item pipeline:

# 健值是调用的优先级,数值越小,越优先调用
ITEM_PIPELINES = {
   'scrapydemo1.pipelines.TextPipeline': 300,
   'scrapydemo1.pipelines.MongoDBPipeline': 400,
}

4.开始命令

scrapy crawl quotes

tips:执行之前手动创建了database,collections不存在程序会自动创建

利用Selector 解析网页数据

https://doc.scrapy.org/en/latest/_static/selectors-sample1.html 源码:

<!DOCTYPE html>

<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/></a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>

可以用 Scrapy shell 发送一个http请求,并获得一个response对象,并用这个response 对象解析html

C:\Users\86158>Scrapy shell https://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2022-12-04 22:16:00 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: scrapybot)
2022-12-04 22:16:00 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19044-SP0
2022-12-04 22:16:00 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2022-12-04 22:16:00 [py.warnings] WARNING: E:\code_tool\python\lib\site-packages\scrapy\utils\request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2022-12-04 22:16:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-12-04 22:16:00 [scrapy.extensions.telnet] INFO: Telnet Password: b3736b50ebd9df0b
2022-12-04 22:16:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-12-04 22:16:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-04 22:16:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-04 22:16:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-04 22:16:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-04 22:16:01 [scrapy.core.engine] INFO: Spider opened
2022-12-04 22:16:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000016D4B7C3D00>
[s]   item       {}
[s]   request    <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   response   <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s]   settings   <scrapy.settings.Settings object at 0x0000016D4B7DB4F0>
[s]   spider     <DefaultSpider 'default' at 0x16d4bc3be80>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

也可以单独将html 源码封装到Selecoor ,然后进行元素提取

# @Time: 2022/12/29 22:13
# @Authot: zzc
# @Project: scrapyitempipelinedemo
# @File:test.py


from scrapy import Selector

body = """<!DOCTYPE html>

<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/>zhangsan</a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>"""

s = Selector(text=body)

# 根据标签值匹配到标签并提取属性值
str_href = s.xpath("//a[text()='zhangsan']/@href").extract_first()
print(str_href)  # image1.html

XPATH

>>> response.selector.xpath('//a')
[<Selector xpath='//a' data='<a href="image1.html">Name: My image ...'>, <Selector xpath='//a' data='<a href="image2.html">Name: My image ...'>, <Selector xpath='//a' data='<a href="image3.html">Name: My image ...'>, <Selector xpath='//a' data='<a href="image4.html">Name: My image ...'>, <Selector xpath='//a' data='<a href="image5.html">Name: My image ...'>]

返回的是 SelectorList 列表

# 查找a标签里面的img元素, .表当前路径下
>>> result = response.xpath('//a')
>>> result.xpath('./img')
[<Selector xpath='./img' data='<img src="image1_thumb.jpg" alt="imag...'>, <Selector xpath='./img' data='<img src="image2_thumb.jpg" alt="imag...'>, <Selector xpath='./img' data='<img src="image3_thumb.jpg" alt="imag...'>, <Selector xpath='./img' data='<img src="image4_thumb.jpg" alt="imag...'>, <Selector xpath='./img' data='<img src="image5_thumb.jpg" alt="imag...'>]

# SelectorList  可以直接使用索引访问
>>> result[0]
<Selector xpath='//a' data='<a href="image1.html">Name: My image ...'>

# 从Selector 中提取 a标签
>>> result.extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']


# 提取第一个selector中的元素
>>> result.extract_first()
'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>'

text()提取 html元素中的文本

#提取元素中的内容,可以利用text() 方法
>>> response.xpath('//a/text()')
[<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name: My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>]
>>> response.xpath('//a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.xpath('//a/text()').extract_first()
'Name: My image 1 '

@arrtibute 提取属性值

>>> response.xpath('//a/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.xpath('//a/@href').extract_first()
'image1.html'


# 利用属性值进行匹配
>>> response.xpath('//a[@href="image1.html"]/text()').extract_first()
'Name: My image 1 '

# 如果匹配不到返回默认值
>>> response.xpath('//a[@href="image6.html"]/text()').extract_first('no find ele')
'no find ele'

CSS

# id选择器
>>> response.css('#images')
[<Selector xpath="descendant-or-self::*[@id = 'images']" data='<div id="images">\n      <a href="imag...'>]

# 找到所有a标签
>>> response.css('a')
[<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image ...'>, <Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image ...'>]

# 根据属性查找
>>> response.css('a[href="image1.html"]').extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>']

::text 提取文本内容

>>> response.css('a::text').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']

::attr 提取属性值

>>> response.css('a[href="image1.html"] img::attr(src)').extract()
['image1_thumb.jpg']

正则表达式

# re 方法
>>> response.xpath('//a/text()').re('Name:\s(.*)')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

# 如果存在多个匹配分组,结果依然会按序输出
>>> response.xpath('//a/text()').re('(.*?):\s(.*)')
['Name', 'My image 1 ', 'Name', 'My image 2 ', 'Name', 'My image 3 ', 'Name', 'My image 4 ', 'Name', 'My image 5 ']


# re_first() 等同于 extract_first()
>>> response.xpath('//a/text()').re_first('(.*?):\s(.*)')
'Name'

>>> response.xpath('//a/text()').re_first('Name:\s(.*)')
'My image 1 '


# response 不能直接调用re() 方法
>>> response.re('Name:\s(.*)')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'HtmlResponse' object has no attribute 're'

# 结合xpath使用

>>> response.xpath('.').re('Name:\s(.*)')
['My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>', 'My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>', 'My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>', 'My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>', 'My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']

案例二

# 1.创建项目 
scrapy startproject scrapyhttpbindemo
cd scrapyhttpbindemo
# 2.创建spider
scrapy genspider httpbin www.httpbin.com

Spider类:

import scrapy


class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['www.httpbin.org']
    start_urls = ['https://www.httpbin.org/get']

    def parse(self, response):
        print('url',response.url)
        print('request',response.request)
        print('status',response.status)
        print('headers',response.headers)
        print('text',response.text)
        print('meta',response.meta)


# out:
'''
url https://www.httpbin.org/get
request <GET https://www.httpbin.org/get>
status 200
headers {b'Content-Length': [b'417'], b'Date': [b'Sun, 11 Dec 2022 01:38:04 GMT'], b'Content-Type': [b'application/json'], b'Server': [b'gunicorn/19.9.0'], b'Access
-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true']}
text {
  "args": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Host": "www.httpbin.org",
    "User-Agent": "Scrapy/2.7.1 (+https://scrapy.org)",
    "X-Amzn-Trace-Id": "Root=1-6395347c-2878436a4bb5a73261fd928f"
  },
  "origin": "120.229.34.25",
  "url": "https://www.httpbin.org/get"
}

meta {'download_timeout': 180.0, 'download_slot': 'www.httpbin.org', 'download_latency': 0.24669480323791504}
'''

meta: 一些附加信息,这些参数往往会附加在meta属性里

此案例没有显示地声明初始请求,是因为Spider默认为我们实现了一个start_requests 方法:
来自父类的方法:

自定义初始请求

GET

import scrapy
from scrapy import Request

class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['www.httpbin.org']
    start_url = 'https://www.httpbin.org/get'  # 可以把get 请求的详情返回

    headers = {
        'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36'
    }

    cookies = {'name':'wangcai', 'age':26}

    # 覆盖父类方法
    def start_requests(self):
        for offset in range(5):
            url = self.start_url + f'?offset={offset}'
            yield Request(url, headers=self.headers,cookies=self.cookies,callback=self.parse_resp,
                          meta={'offset':offset})

    def parse_resp(self,response):
        print('url',response.url)
        print('request',response.request)
        print('status',response.status)
        print('headers',response.headers)
        print('text',response.text)
        print('meta',response.meta)  # 获得请求携带的一些额外参数信息


# out:
'''
url https://www.httpbin.org/get?offset=0
request <GET https://www.httpbin.org/get?offset=0>
status 200
headers {b'Content-Length': [b'583'], b'Date': [b'Sun, 11 Dec 2022 01:55:51 GMT'], b'Content-Type': [b'application/json'], b'Server': [b'gunicorn/19.9.0'], b'Access
-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true']}
text {
  "args": {
    "offset": "0"
  },
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Cookie": "name=wangcai; age=26",
    "Host": "www.httpbin.org",
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36",
    "X-Amzn-Trace-Id": "Root=1-639538a6-5eb791924c49b96d203f73a6"
  },
  "origin": "120.229.34.25",
  "url": "https://www.httpbin.org/get?offset=0"
}

meta {'offset': 0, 'download_timeout': 180.0, 'download_slot': 'www.httpbin.org', 'download_latency': 0.5496621131896973}
...
'''

POST



import scrapy
from scrapy import Request
from scrapy.http import FormRequest, JsonRequest


class HttpbinSpider(scrapy.Spider):
    name = 'httpbin'
    allowed_domains = ['www.httpbin.org']
    start_url = 'https://www.httpbin.org/post'  # 返回post请求的详情信息

    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36'
    }

    data = {'name': 'wangcai', 'age': '26'}

    # 覆盖父类方法
    def start_requests(self):
        yield FormRequest(self.start_url, formdata=self.data, callback=self.parse_resp)  # form表达请求
        yield JsonRequest(self.start_url, data=self.data, callback=self.parse_resp)   # json格式请求
  
    def parse_resp(self, response):
        print('text:', response.text)

# out:
'''
text: {              
  "args": {},        
  "data": "",        
  "files": {},       
  "form": {          
    "age": "26",     
    "name": "wangcai"
  },
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Content-Length": "19",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "www.httpbin.org",
    "User-Agent": "Scrapy/2.7.1 (+https://scrapy.org)",
    "X-Amzn-Trace-Id": "Root=1-63953c11-0d2256250edd887415940663"
  },
  "json": null,
  "origin": "120.229.34.25",
  "url": "https://www.httpbin.org/post"
}

2022-12-11 10:10:26 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
text: {
  "args": {},
  "data": "{\"age\": \"26\", \"name\": \"wangcai\"}",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "en",
    "Content-Length": "32",
    "Content-Type": "application/json",
    "Host": "www.httpbin.org",
    "User-Agent": "Scrapy/2.7.1 (+https://scrapy.org)",
    "X-Amzn-Trace-Id": "Root=1-63953c12-0558fede42d9a64d5a632a8b"
  },
  "json": {
    "age": "26",
    "name": "wangcai"
  },
  "origin": "120.229.34.25",
  "url": "https://www.httpbin.org/post"
}
'''

data = {'name': 'wangcai', 'age': '26'}: form 表单提交的时候健和值都要以字符串格式数据传输
data = {'name': 'wangcai', 'age': 26}: 会报错 => TypeError: to_bytes must receive a str or bytes object, got int

Spider 的使用

Spider 运行流程

在实现Scrapy爬虫项目时,最核心的类便是Spider类,它定义了如何抓取某个网站的流程和解析方式。简单来讲。Spider 就是要做如下两件事:

  • 定义爬取网站的动作
  • 分析爬取下来的网页

对于Spider 类来说, 整个爬取循环如下:
1.以初始化的URL初始化Request并设置回调方法。当该Request成功请求并返回时,将生成Response并将其作为参数传递给该回调方法
2.在回调方法内分析返回的网页内容。返回的结果可以有两种形式

  • 一种是将解析到的有效结果返回字典或Item对象,下一步可直接保存或者处理后保存
  • 另一种是解析的下一个链接,可以利用此链接构造Request并设置新的回调方法,返回Request
    3.如果返回的是字典或Item对象,可通过Feed Exports等形式存入文件,如果设置了Pipeline,可以经由Pipeline处理(如过滤,修正等)并保存
    4.如果返回的是Request, 那么Request执行成功得到新的Response 之后会再次传递给Request中定义的回调方法,可以再次使用选择器来分析新的到
    的网页内容,并根据生成的数据生成Item

循环进行以上几步,便完成了站点的爬取

Spider 类分析

属性介绍:

- name: 爬虫名称,是定义Spider名字的字符串。Spider的名字定义了Scrapy如何定位并
初始化Spider,所以它必须是唯一且必须的。不过我们可以生成多个相同的Spider实例,这没有任
何限制。如果该Spider爬取单个网站,常见的作法是以该网站的域名名称来命名Spider

- allow_domains: 允许爬取的域名,可选的配置,不在范围的链接不会被跟进爬取

- start_urls: 起始url列表,当我们没有实现start_request方法时,默认会从这个列表开始抓取

- custom_setttings:  一个字典,专属本spider的配置,此设置会覆盖全局的设置,且此设置必须在
  初始化前被更新,所以它必须定义为类变量

- crawler: 此属性是由from_crawler方法设置的,代表的是本spider 类对应的Crawler对象,Crawler对象中
包含了很多项目组件,利用它我们可以获取项目的一些配置信息,常见的就是获取项目的配置信息,即Settings

- setttings: 一个Setting对象,利用它我们可以直接获取项目的全局设置变量

方法介绍:

- start_request: 用于生成 Request 对象,必须返回一个可迭代对象,默认会使用start_urls里面的url来构造
Request,而且是GET的请求方式,如果想在启动的时候以POST方式请求某个站点,可以直接重写这个方法

- parse: 当 Response 对象没有指定回调方法时,该方法会被默认调用,它负责处理Response,该方法需要返回一个
包含 Request 或 Item 的可迭代对象

- closed: 当Spider关闭时,该方法会被调用,这里一般会定义释放资源的一些操作或其他收尾操作

Request

Request 对象实际上指的就是scrapy.http.Request 的一个实例,它包含了HTTP请求的基本信息,它会被Engine 交给
Downloader进行处理,返回一个Response对象

它的构造参数:

  • url: 即Request的页面连接,即Request URL
  • callback: 回调方法,需要定义在Spider类中,如果参数不指定,默认为parse方法
  • method:Request 的方法。默认是GET, 还可以是POST,PUT,DELETE等
  • meta: Request 请求携带的额外参数,利用meta可以指定任意参数,特定的参数经由Scrapy 各个组件处理,可以得到不同
    的效果。meta还可以用来向回调方法传递信息
  • body:Request 的内容
  • headers:请求头,字典形式
  • cookies: Request 携带的cookie,可以是字典或列表形式
  • encoding: Request 的编码,默认是utf-8
  • prority: Request 的优先级,默认是,这个优先级是给 Scheduler 做Request 调度使用的,数值越大,就越被优先执行
  • dont_filter:Request 不去重,Scrapy 默认会根据Request的信息进行去重,使得在爬取过程中不会出现重复请求,设置
    为True代表这个Request会被忽略去重操作,默认是False
  • errback:错误处理方法,如果在请求处理过程中出现了错误,这个方法就会被调用
  • flag:请求的标志,可以用于记录类似的处理
  • cb_kwargs: 回调方法的额外参数,可以作为字典传递

Response

Response 由Downloader 执行之后,得到的就是Response 结果了,它代表的是HTTP请求得到的响应结果

属性如下:

  • url: Request URL
  • status: Response 状态码,请求成功的话就是200
  • headers:响应头信息,字典形式
  • body:Response Body,这个通常就是访问页面之后得到的源码结果,比如里面包含的是html 或 json字符串,但注意其结果是bytes类型
  • request: Response 对应的request对象
  • certificate:是 twisted.internet.ssl.Certificate 类型的对象,通常代表一个SSL证书对象
  • ip_address: 是一个ipaddress.IPv4Address 或 ipaddress.IPv6Address 类型的对象,代表服务器的IP地址
  • protocal: The protocol that was used to download the response. For instance: “HTTP/1.0”, “HTTP/1.1”, “h2”

方法:

urljoin(): 是对URL的一个处理方法,可以传入当前页面相对的URL,该方法处理完毕后返回的就是绝对地址
follow/follow_all:是一个根据URL来生成后续Request的方法,和直接构造Request不同的是,该方法接收的url可以是相对URL,不必一定是绝对URL

体系结构:

https://cuiqingcai.com/31047.html
https://www.osgeo.cn/scrapy/intro/tutorial.html

posted @ 2022-12-04 10:48  chuangzhou  阅读(1501)  评论(0编辑  收藏  举报