MiniCrowler

MiniCrawler

Github Path :

https://github.com/LixinZhang/miniCrowler

Introduction:

  • MiniCrawler is a simple web crawler implemented by Python.
  • Threadpool tech is used to speed up fetching pages.

  • One can config the crawler through modify the file config.py. And start the crawling job using python run.py.

  • The webs pages fetched will be stored in pages folder.
  • check_status.py helps you check the job's status as following:
Rank            Hostname        Times   
----------------------------------------
   1             buaa.edu.cn        40  
   2             baixing.com        32  
   3             cnblogs.com        29  
   4              hao123.com         5  
   5           xinhuanet.com         2  
   6          visionplaza.cn         2  
   7           people.com.cn         2  
   8                  org.cn         2  
   9                 news.cn         2  
  10             most.gov.cn         2

More Detail

You can find more detail in my Chinese Blog. Python 多线程抓取网页

posted @ 2014-01-04 22:02  糖拌咸鱼  阅读(308)  评论(0编辑  收藏  举报