scarpy 不仅提供了 scrapy crawl spider 命令来启动爬虫,还提供了一种利用 API 编写脚本 来启动爬虫的方法。
scrapy 基于 twisted 异步网络库构建的,因此需要在 twisted 容器内运行它。
可以通过两个 API 运行爬虫:scrapy.crawler.CrawlerProcess 和 scrapy.crawler.CrawlerRunner
scrapy.crawler.CrawlerProcess
这个类内部将会开启 twisted.reactor、配置log 和 设置 twisted.reactor 自动关闭,该类是所有 scrapy 命令使用的类。
运行单个爬虫示例
class QiushispiderSpider(scrapy.Spider): name = 'qiushiSpider' # allowed_domains = ['qiushibaike.com'] start_urls = ['https://tianqi.2345.com/'] def start_requests(self): return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] # def parse(self, response): print('proxy simida') if __name__ == '__main__': from scrapy.crawler import CrawlerProcess process = CrawlerProcess() process.crawl(QiushispiderSpider) # 'qiushiSpider' process.start()
process.crawl() 内的参数可以是 爬虫名'qiushiSpider',也可以是 爬虫类名QiushispiderSpider
这种方式并没有使用爬虫的配置文件settings
2019-05-27 14:39:57 [scrapy.crawler] INFO: Overridden settings: {}
获取配置
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) process.crawl(QiushispiderSpider) # 'qiushiSpider' process.start()
运行多个爬虫
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start()
scrapy.crawler.CrawlerRunner
1. 更好的控制爬虫运行过程
2. 显式运行 twisted.reactor,显式关闭 twisted.reactor
3. 需要在 CrawlerRunner.crawl 返回的对象中添加回调函数
运行单个爬虫示例
class QiushispiderSpider(scrapy.Spider): name = 'qiushiSpider' # allowed_domains = ['qiushibaike.com'] start_urls = ['https://tianqi.2345.com/'] def start_requests(self): return [scrapy.Request(url=self.start_urls[0], callback=self.parse)] # def parse(self, response): print('proxy simida') if __name__ == '__main__': # test CrawlerRunner from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from scrapy.utils.project import get_project_settings configure_logging({'LOG_FORMAT':'%(levelname)s: %(message)s'}) runner = CrawlerRunner(get_project_settings()) d = runner.crawl(QiushispiderSpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
configure_logging 设定日志输出格式
addBoth 添加 关闭 twisted.crawl 的回调函数
运行多个爬虫
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
也可以异步实现
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): ... class MySpider2(scrapy.Spider): ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script
参考资料:
https://blog.csdn.net/weixin_33857230/article/details/89571872
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)