十九、通过Scrapy提供的API在程序中启动爬虫

Scrapy在Twisted异步网络库上构建，所以如果程序必须在Twisted reactor里运行

1、方式一：使用CrawlerProcess类

　　CrawlerProcess类(scrapy.crawler.CrawlerProcess)内部将会开启Twisted reactor、配置log和设置Twisted reactor自动关闭。

　　可以在CrawlerProcess初始化时传入设置的参数，使用crawl方式运行指定的爬虫类。

　　```

　　if __name__=="__main__":

　　　　process = CrawlerProcess(

　　　　　　{

　　　　　　　　"USER_AGENT":"Mozilla/5.0 ...."，

　　　　　　}

　　　　)

　　　　process.crawl(爬虫类）

　　　　process.start()

　　```

　　也可以在CrawlerProcess初始化时传入项目的settings信息，在crawl方法中传入爬虫的名字。

　　```

　　if __name__=="__main__":

　　　　process = CrawlerProcess(

　　　　　　project_settings()

　　　　)

　　　　process.crawl(爬虫名）

　　　　process.start()

　　```

2、方式二：使用CrawlerRunner

　　使用CrawlerRunner时，在spider运行结束后，必须自行关闭Twisted reactor，需要在CrawlerRunner.crawl所返回的对象中添加回调函数。

　　```

　　if __name__=="__main__":

　　　　configure_logging({"LOG_FORMAT":"%(levelname)s:%(message)s"})　　# 使用configure_logging配置了日志信息的打印格式

　　　　runner = CrawlerRunner()

　　　　d = runner.crawl(爬虫类）　　# 通过CrawlerRunner的crawl方法添加爬虫

　　　　d.addBoth(lambda _:reactor.stop())　　# 通过addBoth添加关闭Twisted reactor的回调函数

　　　　reactor.run()

　　```

3、在一个进程中启动多个爬虫

　　1、CrawlerProcess方式实现

　　　　```

　　　　import scrapy

　　　　from scrapy.crawler import CrawlerProcess

　　　　class Myspider_1(scrapy.Spider):

　　　　　　...

　　　　class Myspider_2(scrapy.Spider):

　　　　　　...

　　　　process = CrawlerProcess()

　　　　process.crawl(Myspider_1)

　　　　process.crawl(Myspider_2)

　　　　process.start()

　　　　```

　　2、CrawlerRunner方式实现

　　　　1、第一种方式

　　　　　　```

　　　　　　import scrapy

　　　　　　from twisted.internet import reactor

　　　　　　from scrapy.crawler import CrawlerRunner

　　　　　　from scrapy.utils.log import configure_logging

　　　　　　class Myspider_1(scrapy.Spider):

　　　　　　　　...

　　　　　　class Myspider_2(scrapy.Spider):

　　　　　　　　...

　　　　　　configure_logging()

　　　　　　runner = CralwerRunner()

　　　　　　runner.crawl(Myspider_1)

　　　　　　runner.crawl(Myspider_2)

　　　　　　d = runner.join()

　　　　　　d.addBoth(lambda _: reactor.stop())

　　　　　　reactor.run()

　　　　　　```

　　　　2、第二种方式

　　　　　　```

　　　　　　from twisted.internet import reactor,defer

　　　　　　from scrapy.crawler import CrawlerRunner

　　　　　　from scrapy.utils.log import configure_logging

　　　　　　class Myspider_1(scrapy.Spider):

　　　　　　　　...

　　　　　　class Myspider_2(scrapy.Spider):

　　　　　　　　...

　　　　　　configure_logging()

　　　　　　runner = CrawlerRunner()

　　　　　　@defer.inlineCallbacks

　　　　　　def crawl():

　　　　　　　　yield runner.crawl(Myspider_1)

　　　　　　　　yield runner.crawl(Myspider_2)

　　　　　　　　reactor.stop()

　　　　　　crawl()

　　　　　　reactor.run()

　　　　　　```

posted @ 2020-06-17 11:29 Norni 阅读(483) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

十九、通过Scrapy提供的API在程序中启动爬虫

1、方式一：使用CrawlerProcess类

2、方式二：使用CrawlerRunner

3、在一个进程中启动多个爬虫

1、CrawlerProcess方式实现

2、CrawlerRunner方式实现

公告

　　1、CrawlerProcess方式实现

　　2、CrawlerRunner方式实现