如何提高Scrapy爬取数据效率

之前的都是单线程

多线程多线程

增加并发：
    默认scrapy开启的并发线程为32个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。

降低日志级别：
    在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写：LOG_LEVEL = ‘INFO’

禁止cookie：
    如果不是真的需要cookie，则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率，提升爬取效率。在配置文件中编写：COOKIES_ENABLED = False

禁止重试：
    对失败的HTTP进行重新请求（重试）会减慢爬取速度，因此可以禁止重试。在配置文件中编写：RETRY_ENABLED = False

减少下载超时：
    如果对一个非常慢的链接进行爬取，减少下载超时可以能让卡住的链接快速被放弃，从而提升效率。在配置文件中进行编写：DOWNLOAD_TIMEOUT = 10 超时时间为10s

校花网全站数据爬取示例

爬虫

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.response.html import HtmlResponse
from xiaohuawang.items import XiaohuawangItem


class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = [
        'http://www.521609.com/gaozhongxiaohua/',
        'http://www.521609.com/xiaocaoshuaige/',
        'http://www.521609.com/zhuankexiaohua/',
        'http://www.521609.com/daxuemeinv/',
        'http://www.521609.com/meinvxiaohua/',
        'http://www.521609.com/xiaoyuanmeinv/',
        'http://www.521609.com/qingchunmeinv/',
    ]

    def parse(self, response: HtmlResponse):
        img_url_list = response.xpath('//div[@class="index_img list_center"]/ul/li//img/@src').extract()
        img_title_list = response.xpath('//div[@class="index_img list_center"]/ul/li//img/@alt').extract()
        # img_title_list = [i.encode('iso-8859-1').decode('gbk') for i in img_title_list]
        img_url_list = ['http://www.521609.com' + i for i in img_url_list]
        item = XiaohuawangItem()
        item['image_urls'] = img_url_list
        yield item

        next_url = response.xpath('//div[@class="listpage"]/ol/li[position() = last() - 2]/a/@href').extract_first()
        if next_url:
            perfix = response.url.rsplit('/', 1)[0]
            next_url = perfix + '/' + next_url
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)

item

import scrapy


class XiaohuawangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()

pipeline

from scrapy.pipelines.images import ImagesPipeline # 若爆出 PIL 模块不存在的错误，执行 pip install Pillow 
from scrapy.exceptions import DropItem
from scrapy import Request



class XiaohuawangPipeline(object):
    def process_item(self, item, spider):
        return item


class ImgDownloadPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

settings

ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
    'xiaohuawang.pipelines.ImgDownloadPipeline': 1,
}
# 设置图片下载路径
IMAGES_STORE = r'images'
# 过期天数
IMAGES_EXPIRES = 90  # 90天内抓取的都不会被重抓
LOG_LEVEL = 'WARNING'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 20

posted @ 2019-08-11 16:07 写bug的日子阅读(205) 评论(0) 编辑收藏举报

刷新页面返回顶部

写bug的日子

如何提高Scrapy爬取数据效率

如何提高Scrapy爬取数据效率

校花网全站数据爬取示例

公告