如何提高Scrapy爬取数据效率
如何提高Scrapy爬取数据效率
之前的都是单线程
多线程 多线程
增加并发:
默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
降低日志级别:
在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’
禁止cookie:
如果不是真的需要cookie,则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
禁止重试:
对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
减少下载超时:
如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s
校花网全站数据爬取示例
爬虫
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.response.html import HtmlResponse
from xiaohuawang.items import XiaohuawangItem
class ImgSpider(scrapy.Spider):
name = 'img'
# allowed_domains = ['www.xxx.com']
start_urls = [
'http://www.521609.com/gaozhongxiaohua/',
'http://www.521609.com/xiaocaoshuaige/',
'http://www.521609.com/zhuankexiaohua/',
'http://www.521609.com/daxuemeinv/',
'http://www.521609.com/meinvxiaohua/',
'http://www.521609.com/xiaoyuanmeinv/',
'http://www.521609.com/qingchunmeinv/',
]
def parse(self, response: HtmlResponse):
img_url_list = response.xpath('//div[@class="index_img list_center"]/ul/li//img/@src').extract()
img_title_list = response.xpath('//div[@class="index_img list_center"]/ul/li//img/@alt').extract()
# img_title_list = [i.encode('iso-8859-1').decode('gbk') for i in img_title_list]
img_url_list = ['http://www.521609.com' + i for i in img_url_list]
item = XiaohuawangItem()
item['image_urls'] = img_url_list
yield item
next_url = response.xpath('//div[@class="listpage"]/ol/li[position() = last() - 2]/a/@href').extract_first()
if next_url:
perfix = response.url.rsplit('/', 1)[0]
next_url = perfix + '/' + next_url
print(next_url)
yield scrapy.Request(url=next_url, callback=self.parse)
item
import scrapy
class XiaohuawangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
image_paths = scrapy.Field()
pipeline
from scrapy.pipelines.images import ImagesPipeline # 若爆出 PIL 模块不存在的错误,执行 pip install Pillow
from scrapy.exceptions import DropItem
from scrapy import Request
class XiaohuawangPipeline(object):
def process_item(self, item, spider):
return item
class ImgDownloadPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
settings
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'xiaohuawang.pipelines.ImgDownloadPipeline': 1,
}
# 设置图片下载路径
IMAGES_STORE = r'images'
# 过期天数
IMAGES_EXPIRES = 90 # 90天内抓取的都不会被重抓
LOG_LEVEL = 'WARNING'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 20