Scrapy怎样同时运行多个爬虫?
默认情况下,当你运行 scrapy crawl 命令的时候,scrapy只能在单个进程里面运行一个爬虫。然后Scrapy运行方式除了采用命令行式的运行方式以外还可以使用API的方式来运行爬虫,而采用API的方式运行的爬虫是支持运行多个爬虫的。
下面的案例是运行多个爬虫:
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() # 初始化事件循环 process.crawl(MySpider1) # 将爬虫类方式事件循环 process.crawl(MySpider2) # 将爬虫类方式事件循环 process.start() # the script will block here until all crawling jobs are finished
此外采用 CrawlerRunner 也是可行的:
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
deferreds的方式来运行:
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
更多细节参考: