开源的多线程爬虫框架
融合了scrapy的架构理念和twisted的任务调度思想,目前只是个精简版的,开源地址
https://github.com/aware-why/multithreaded_crawler/,已附带demo演示的例子。
有兴趣的可以参与进来,有更好想法的人可以直接发起pull request,热烈欢迎大家的贡献。
multithreaded_crawler
A condensed crawler framework of “multithreaded model”
dependency
At present, the framework depends on nothing except for modules in the python standard libraries.
Usage
cd threaded_spider
python run.py --help
You will see a demo output by python run.py
, it crawls the sina.com.cn using five threads and has the crawling depth limited to be 2 by default (It's tested in python2.7).
In threaded_spider directory, there are extra log files whose name like “spider.*.log” respectively generated using python run.py --thread=*
command.
Community
QQ Group: 4704309
Your contribute will be welcome.
自助者天助;自天佑之,吉无不利。