MiniCrowler
MiniCrawler
Github Path :
https://github.com/LixinZhang/miniCrowler
Introduction:
- MiniCrawler is a simple web crawler implemented by Python.
-
Threadpool tech is used to speed up fetching pages.
-
One can config the crawler through modify the file
config.py
. And start the crawling job usingpython run.py
. - The webs pages fetched will be stored in
pages
folder. check_status.py
helps you check the job's status as following:
Rank Hostname Times ---------------------------------------- 1 buaa.edu.cn 40 2 baixing.com 32 3 cnblogs.com 29 4 hao123.com 5 5 xinhuanet.com 2 6 visionplaza.cn 2 7 people.com.cn 2 8 org.cn 2 9 news.cn 2 10 most.gov.cn 2
More Detail
You can find more detail in my Chinese Blog. Python 多线程抓取网页