Sweety

Practice makes perfect

导航

Python多进程multiprocessing.Pool()

Posted on 2017-09-14 15:26  蓝空  阅读(2336)  评论(0编辑  收藏  举报

1、multiprocessing.pool函数

class multiprocessing.pool.Pool([processes[, initializer[, initargs[, maxtasksperchild[, context]]]]])
用途:A process pool object which controls a pool of worker processes to which jobs can be submitted. It supports asynchronous results with timeouts and callbacks and has a parallel map implementation.
参数介绍:
processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.

If initializer is not None then each worker process will call initializer(*initargs) when it starts.

maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None, which means worker processes will live as long as the pool.

context can be used to specify the context used for starting the worker processes. Usually a pool is created using the function multiprocessing.Pool() or the Pool() method of a context object. In both cases context is set appropriately.

Note that the methods of the pool object should only be called by the process which created the pool.
关于Pool()的相关翻译参见:http://www.cnblogs.com/congbo/archive/2012/08/23/2652490.html

关于multiprocess:
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.
上面说的尤其注意这里产生的是多进程,而不是多线程,所以pool函数里面的第一个参数如果大于CPU的核心数可能反而导致效率更低,可以实测一下!!!

for more information about multiprocessing,please check the Python API

2、实例和介绍

主要介绍map函数的使用,一手包办了序列操作、参数传递和结果保存等一系列的操作。
首先是引入库:

from multiprocessing.dummy import Pool 
pool=Pool(4) 
results=pool.map(爬取函数,网址列表)

本文将一个简单的例子来看一下如何使用map函数以及这种方法与普通方法的对比情况。

import time
from multiprocessing.dummy import Pool

def getsource(url):
    html=requests.get(url)

urls=[]
for i in range(1,21):
    newpage='http://tieba.baidu.com/p/3522395718?pn='+str(i)
    urls.append(newpage)

timex=time.time()  #测试一
for i in urls:
    getsource(i)
print (time.time()-timex)

#这里是输出的结果:
#10.2820000648 


time1=time.time()  #测试二
pool=Pool(4)
results=pool.map(getsource,urls)
pool.close()
pool.join()
print (time.time()-time1)

#这里是输出结果:
#3.23600006104

对比以上两种方法,可以很明显地看出 测试二比测试一要快很多。

对程序做一下解释:
测试一种
for i in urls:
getsource(i) #使程序一直遍历urls列表中的网址,然后循环调用getsource函数

测试二中:
pool=Pool(4) #声明了4个线程数量,这里的个数根据你电脑的CPU个数来定。
results=pool.map(getsource,urls) #这里使用map函数,并且函数的参数为自定义函数名称,以及函数中的参数(这里为一个列表)
pool.close() #关闭pool对象
pool.join() #join函数的主要作用是等待所有的线程(4个)都执行结束后
print (time.time()-time1) #输出所用时间差

列举Pool的其他应用函数:

from multiprocessing import Pool

def f(x): #定义一个自定义函数f
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes

    result = pool.apply_async(f, (10,))    # 评估"f(10)" asynchronously
    print result.get(timeout=1)           #限定反应时间为1 通过get函数取得result的结果

    print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

    it = pool.imap(f, range(10)) #使用imap函数执行自定义函数
    print it.next()                       # prints "0" 使用next函数一个一个地取得it的执行结果
    print it.next()                       # prints "1"
    print it.next(timeout=1)              # prints "4" unless your computer is *very* slow

    import time
    result = pool.apply_async(time.sleep, (10,))
    print result.get(timeout=1)           # raises TimeoutError

实例参考:http://blog.csdn.net/winterto1990/article/details/47976105