打赏

多进程爬虫

多进程简介

一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序


为什么学习多线程

提升爬虫效率


多进程和多线程的区别

工厂 ==> 车间 ==> 工人


多进程的使用方法

1 from multiprocessing import Pool
2 pool = Pool(processes=4)
3 pool.map(func,iterable)
 

性能对比

爬取url:https://www.qiushibaike.com/8hr/page/1/

 1 import re
 2 import time
 3 from multiprocessing import Pool
 4  5 import requests
 6  7 headers = {
 8     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
 9 }
10 11 def re_scraper(url):
12     res = requests.get(url,headers=headers)
13     names = re.findall('<h2>(.*?)</h2>', res.text, re.S)
14     contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S)
15     laughs = re.findall('<i class="number">(\d+)</i>',res.text,re.S)
16     comments = re.findall('<i class="number">(\d+)</i>', res.text, re.S)
17     infos = list()
18     for name,content,laugh,comment in zip(names,contents,laughs,comments):
19         info = {
20             'name':name,
21             'content':content,
22             'laugh':laugh,
23             'comment':comment
24         }
25         infos.append(info)
26     return infos
27 28 if __name__ == "__main__":
29     urls = ['https://www.qiushibaike.com/8hr/page/{}/'.format(str(i)) for i in range(1, 36)]
30     start_1 = time.time()
31     for url in urls:
32         re_scraper(url)
33     end_1 = time.time()
34     print('串行爬虫耗时:',end_1 - start_1)
35 36     start_2 = time.time()
37     pool = Pool(processes=2)
38     pool.map(re_scraper,urls)
39     end_2 = time.time()
40     print('2进程爬虫耗时:',end_2 - start_2)
41 42     start_3 = time.time()
43     pool = Pool(processes=4)
44     pool.map(re_scraper,urls)
45     end_3 = time.time()
46     print('4进程爬虫耗时:',end_3 - start_3)
47  

 


1 运行结果:
2 
3 [Running] python "f:\WWW\test_py\compare_test.py"
4 串行爬虫耗时: 14.95523715019226
5 2进程爬虫耗时: 11.39123272895813
6 4进程爬虫耗时: 4.0303635597229
7 
8 [Done] exited with code=0 in 32.827 seconds

 

posted on 2018-12-24 01:14  XuCodeX  阅读(214)  评论(0编辑  收藏  举报

导航