从脚本中运行Scrapy
文档: https://www.osgeo.cn/scrapy/topics/practices.html
1、scrapy.crawler.CrawlerProcess
Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。
可以使用scrapy.crawler.CrawlerProcess
这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) # 'followall' is the name of one of the spiders of the project. process.crawl('followall', domain='scrapinghub.com') process.start() # the script will block here until the crawling is finished
使用 get_project_settings
得到一个 Settings
具有项目设置的实例。
2、scrapy.crawler.CrawlerRunner
使用这个类,在调度spider之后应该显式地运行reactor。建议您使用 CrawlerRunner
而不是 CrawlerProcess
如果您的应用程序已经在使用Twisted,并且您希望在同一个反应器中运行Scrapy。
from twisted.internet import reactor import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider(scrapy.Spider): # Your spider definition ... configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) runner = CrawlerRunner() d = runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
3、在同一进程中运行多个spider
默认情况下,当您运行时,scrapy为每个进程运行一个spider scrapy crawl
. 但是,Scrapy支持使用 internal API .
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
通过链接延迟来按顺序运行spider
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished