scrapy爬虫学习系列七：scrapy常见问题解决方案

1 常见错误

1.1 错误： ImportError: No module named win32api

官方参考： https://doc.scrapy.org/en/latest/faq.html#scrapy-crashes-with-importerror-no-module-named-win32api

官方参考里面有个win32的连接，你下载后安装就可以了。

1.2 DEBUG: Forbidden by robots.txt: <GET https://www.baidu.com>

官方参考： https://doc.scrapy.org/en/latest/topics/settings.html#robotstxt-obey

修改setting.py中的ROBOTSTXT_OBEY = False

1.3 抓取xml文档的时候使用xpath无法返回结果

官方参考： https://doc.scrapy.org/en/latest/faq.html#i-m-scraping-a-xml-document-and-my-xpath-selector-doesn-t-return-any-items

response.selector.remove_namespaces()
response.xpath("//link")

这个问题正常情况我们不用执行remove_namespaces的，只有在抓取不到数据的时候的时候尝试修改下。

1.4 响应流乱码

官方参考： https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.encoding

1 在请求的构造函数中设定encoding

2 在http header中设置

3 在response body定义encoding

4 对获取到的响应流进行转码，这也是最后的方法了。

def parse(self, response):
    #具体怎么转，要看你的编码的
    response=response.replace(encoding="gbk")

    # todo extract a item

2常用解决方案

2.1 scrapy发送抓取数据个数的邮件到指定用户

官方文档有关于email的说明： https://doc.scrapy.org/en/latest/topics/email.html

博友的一篇文章，使用了scrapy的mail模块： http://blog.csdn.net/you_are_my_dream/article/details/60868329

我自己尝试了下使用scrapy的mail模块发送邮件，但是日志是发送成功，但是一直没有收到邮件，不知道啥情况，所以换成了smtpllib发送。修改pipeline.py文件如下：

class MailPipeline(object):
    def __init__(self):
        self.count = 0
    def open_spider(self,spider):
        pass
    def process_item(self, item, spider):
        self.count=self.count + 1
        return item                         #切记，这个return item 必须有， 没有的话，后续的pipeline没法处理数据的。
    def close_spider(self, spider):
        import smtplib
        from email.mime.text import MIMEText
        _user = "1072892917@qq.com"
        _pwd = "xxxxxxxx"                   #这个密码不是直接登陆的密码， 是smtp授权码。具体可以参考http://blog.csdn.net/you_are_my_dream/article/details/60868329
        _to = "1072892917@qq.com"

        msg = MIMEText("Test")
        msg["Subject"] = str(self.count)   #这里我们把抓取到的item个数，当主题发送
        msg["From"] = _user
        msg["To"] = _to
        try:
            s = smtplib.SMTP_SSL("smtp.qq.com", 465)       #参考 http://service.mail.qq.com/cgi-bin/help?subtype=1&no=167&id=28
            s.login(_user, _pwd)
            s.sendmail(_user, _to, msg.as_string())
            s.quit()
            print("Success!")
        except smtplib.SMTPException as e:
            print("Falied,%s" % e)

import json
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

修改settings.py文件如下

ITEM_PIPELINES = {
　　　　#这个302，303，数字越小，越先通过管道。
      'quotesbot.pipelines.MailPipeline': 302,
     'quotesbot.pipelines.JsonWriterPipeline': 303
}

这样我们可以把抓取到的数据先通过MailPipeline获取到抓取的个数，然后发送邮件，在经过jsonWritePipeline进行持久化处理，当然你可以修改pipeline的顺序，发送邮件的时候把持久化的文件作为附件发送。

注意： scrapy的mail模块使用的是twist的mail模块，支持异步的。

2.2 在scrapy中使用beautifulsoup

scrapy 官方参考： https://doc.scrapy.org/en/latest/faq.html#can-i-use-scrapy-with-beautifulsoup

bs4官方英文参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

bs4官方中文参考： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

from bs4 import BeautifulSoup
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        # use lxml to get decent HTML parsing speed
        soup = BeautifulSoup(response.text, 'lxml')
        yield {
            "url": response.url,
            "title": soup.h1.string
        }

2.3 抓取的item不同的属性值的提取需要来自多个页面，不是单个页面就能提取到所有属性

官方参考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-scrape-an-item-with-attributes-in-different-pages

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

这个是通过request的meta传递给后续的请求的，最终的那个请求返回item结果。

2.4 如何抓取一个需要登陆的页面

官方参考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-simulate-a-user-login-in-my-spider

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

这个就是使用FromRequest把用户名和密码提交，获取对应的服务器响应，这里需要对响应流进行判定，如果登陆成功进行抓取，如果失败退出。

2.5 不创建工程运行一个爬虫

官方参考： https://doc.scrapy.org/en/latest/faq.html#can-i-run-a-spider-without-creating-a-project

官方参考： https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

2.6 最简单的方式存储抓取到的数据

官方参考： https://doc.scrapy.org/en/latest/faq.html#simplest-way-to-dump-all-my-scraped-items-into-a-json-csv-xml-file

scrapy crawl myspider -o items.json
scrapy crawl myspider -o items.csv
scrapy crawl myspider -o items.xml
scrapy crawl myspider -o items.jl

这个方法是最快的方法了。但是有个问题。 json的使用的ansi编码，对中文不支持，我们需要使用utf-8的。这个时候这个就有问题。

1.可以在设置中指定 FEED_EXPORT_ENCODING = 'utf-8'

2. 参考我写的导出各个格式的item结果。http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

这2个方法都是可以的，建议使用第二种方法，这样扩展比较方便。

2.7 指定条件满足就停止爬虫

官方参考： https://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

如果设置的抓取到指定item个数就终止的话，可以采用如下方法：

# -*- coding: utf-8 -*-
import scrapy

from scrapy.exceptions import CloseSpider

class ToScrapeSpiderXPath(scrapy.Spider):

    def __init__(self):
        self.count=0     #设置下当前个数
        self.max_count=100  #设置最大抓取个数
    name = 'toscrape-xpath'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            self.count =self.count +1
            if self.count > self.max_count:
                raise CloseSpider('bandwidth_exceeded')
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

当然，也是可以调用self.crawler.stop()方法。

其实scrapy内置有个中间件可以设置一些指定的条件去关闭爬虫的具体参考 https://doc.scrapy.org/en/latest/topics/extensions.html#module-scrapy.extensions.closespider

关于这个中间件的设置，简单说下：

CLOSESPIDER_TIMEOUT ：爬虫打开超过指定的时间就关闭爬虫
CLOSESPIDER_ITEMCOUNT ：指定数量的item通过了pipeline就关闭爬虫，如果还有请求，是会继续工作的。但是多个item个数不会超过并发个数 CONCURRENT_REQUESTS.
CLOSESPIDER_PAGECOUNT : 抓取到指定页面个数的时候关闭爬虫
CLOSESPIDER_ERRORCOUNT ：捕获到指定的错误次数的时候关闭爬虫

内置的几条如果没能合乎你的心意，你可以自己写一个扩展即可。具体可以参考： https://doc.scrapy.org/en/latest/topics/extensions.html#writing-your-own-extension

2.7 避免爬虫被banned

官方参考： https://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

1 设置一个list集合存放userAgent,每次请求从几何里面选择一个userAgent.

2 禁用cookies,有些网址启用cookies来识别bot.

3 使用下载延迟download_delay，有些网址对单位时间内请求次数有限制，过多请求会被禁的。

4 如果肯能的话使用谷歌缓存，而不是直接请求网址。

5 使用ip池，比如ProxyMesh，scrapoxy

6 使用高度分布的下载器，比如Crawlera

2.8 启动爬虫的时候接受参数

官方参考： https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]

这样，我们运行爬虫的时候使用如下的即可

scrapy crawl myspider -a category=electronics

2.9 修该pipeline支持多种格式导出

官方参考： https://doc.scrapy.org/en/latest/topics/exporters.html#using-item-exporters

博客参考（我自己的）： http://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html

具体项目的参考： https://github.com/zhaojiedi1992/ScrapyCnblogs

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy import signals
from scrapy.exporters import *
import logging
logger=logging.getLogger(__name__)
class BaseExportPipeLine(object):
    def __init__(self,**kwargs):
        self.files = {}
        self.exporter=kwargs.pop("exporter",None)
        self.dst=kwargs.pop("dst",None)
        self.option=kwargs
    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open(self.dst, 'wb')
        self.files[spider] = file
        self.exporter = self.exporter(file,**self.option)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

# 
# 'fields_to_export':["url","edit_url","title"] 设定只导出部分字段,以下几个pipeline都支持这个参数
# 'export_empty_fields':False 设定是否导出空字段 以下几个pipeline都支持这个参数
# 'encoding':'utf-8' 设定默认编码，以下几个pipeline都支持这个参数
# 'indent' :1： 设置缩进，这个参数主要给JsonLinesExportPipeline使用
# "item_element":"item"设置xml节点元素的名字，只能XmlExportPipeline使用,效果是<item></item>
# "root_element":"items"设置xml根元素的名字，只能XmlExportPipeline使用，效果是<items>里面是很多item</items>
# "include_headers_line":True 是否包含字段行， 只能CsvExportPipeline使用
# "join_multivalued":","设置csv文件的分隔符号， 只能CsvExportPipeline使用
# 'protocol':2设置PickleExportPipeline 导出协议，只能PickleExportPipeline使用
# "dst":"items.json" 设置目标位置
class JsonExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,}
        super(JsonExportPipeline, self).__init__(**option)
class JsonLinesExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}
        super(JsonLinesExportPipeline, self).__init__(**option)
class XmlExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}
        super(XmlExportPipeline, self).__init__(**option)
class CsvExportPipeline(BaseExportPipeLine):
    def __init__(self):
        # 设置分隔符的这个，我这里测试是不成功的
        option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}
        super(CsvExportPipeline, self).__init__(**option)
class  PickleExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2}
        super(PickleExportPipeline, self).__init__(**option)
class  MarshalExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":MarshalItemExporter,"dst":"items.marsha"}
        super(MarshalExportPipeline, self).__init__(**option)
class  PprintExportPipeline(BaseExportPipeLine):
    def __init__(self):
        option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}
        super(PprintExportPipeline, self).__init__(**option)

posted on 2017-11-25 11:12 LinuxPanda 阅读(10513) 评论(0) 编辑收藏举报

刷新页面返回顶部

赵杰迪