8-[多线程] 进程池线程池
1、为甚需要进程池,线程池
介绍 官网:https://docs.python.org/dev/library/concurrent.futures.html concurrent.futures模块提供了高度封装的异步调用接口 ThreadPoolExecutor:线程池,提供异步调用 ProcessPoolExecutor: 进程池,提供异步调用 Both implement the same interface, which is defined by the abstract Executor class.
2、基本方法
1、submit(fn, *args, **kwargs) 异步提交任务 2、map(func, *iterables, timeout=None, chunksize=1) 取代for循环submit的操作 3、shutdown(wait=True) 相当于进程池的pool.close()+pool.join()操作 wait=True,等待池内所有任务执行完毕回收完资源后才继续 wait=False,立即返回,并不会等待池内的任务执行完毕 但不管wait参数为何值,整个程序都会等到所有任务执行完毕 submit和map必须在shutdown之前 4、result(timeout=None) 取得结果 5、add_done_callback(fn) 回调函数
3、进程池
The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously.
ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned. class concurrent.futures.ProcessPoolExecutor(max_workers=None, mp_context=None) An Executor subclass that executes calls asynchronously using a pool of at most max_workers processes. If max_workers is None or not given, it will default to the number of processors on the machine. If max_workers is lower or equal to 0, then a ValueError will be raised.
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor import os import time def task(name): print('%s is running 《pid: %s》' % (name, os.getpid())) time.sleep(2) if __name__ == '__main__': # p = Process(target=task, args=('子',)) # p.start pool = ProcessPoolExecutor(4) # 进程池max_workers:4个 for i in range(10): # 总共执行10次,每次4个进程的执行 pool.submit(task, '子进程%s' % i) print('主')
4、线程池
ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously. class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='') An Executor subclass that uses a pool of at most max_workers threads to execute calls asynchronously. Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor. New in version 3.6: The thread_name_prefix argument was added to allow users to control the threading. Thread names for worker threads created by the pool for easier debugging.
5、map函数:取代了for+submit
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor import os,time,random def task(n): print('%s is runing' %os.getpid()) time.sleep(random.randint(1,3)) return n**2 if __name__ == '__main__': executor=ThreadPoolExecutor(max_workers=3) # for i in range(11): # future=executor.submit(task,i) executor.map(task,range(1,12)) #map取代了for+submit
6、异步调用与回调机制
(1)提交任务的两种方式
# 提交任务的两种方式 # 1、同步调用 提交完任务后,拿到结果,再执行下一行代码,导致程序是串行执行 # 2、异步调用 提交完任务后,不用等待任务执行完毕
(2)同步调用
from concurrent.futures import ThreadPoolExecutor import time import random # 吃饭 def eat(name): print('%s is eat' % name) time.sleep(random.randint(1,5)) ret = random.randint(7, 13) * '#' return {'name': name, 'ret': ret} # 称重 def weight(body): name = body['name'] size = len(body['ret']) print('%s 现在的体重是%s' %(name, size)) if __name__ == '__main__': pool = ThreadPoolExecutor(15) rice1 = pool.submit(eat, 'alex').result() # 取得结果 # 执行函数eat weight(rice1) # 执行函数weight rice2 = pool.submit(eat, 'jack').result() weight(rice2) rice3 = pool.submit(eat, 'tom').result() weight(rice3)
(2)同步调用2
(3)回调函数
(4)是钩子函数?
钩子函数是Windows消息处理机制的一部分,通过设置“钩子”,应用程序可以在系统级对所有消息、事件进行过滤,访问在正常情况下无法访问的消息。钩子的本质是一段用以处理系统消息的程序,通过系统调用,把它挂入系统 --- 百度百科的定义
对于前端来说,钩子函数就是指再所有函数执行前,我先执行了的函数,即 钩住 我感兴趣的函数,只要它执行,我就先执行。此概念(或者说现象)跟AOP(面向切面编程)很像
7.线程池爬虫应用
(1)requests模块
import requests # 输入网址,得到网址的源代码 response = requests.get('http://www.cnblogs.com/venicid/p/8923096.html') print(response) # 输出<Response [200]> print(response.text) # 以文本格式输出
(2)线程池爬虫
import requests import time from concurrent.futures import ThreadPoolExecutor # 输入网址,得到网址的源代码 def get_code(url): print('GET ', url) response = requests.get(url) time.sleep(3) code = response.text return {'url': url, 'code': code} # 打印源代码的长度 def print_len(ret): ret = ret.result() url = ret['url'] code_len = len(ret['code']) print('%s length is %s' % (url, code_len)) if __name__ == '__main__': url_list = [ 'http://www.cnblogs.com/venicid/default.html?page=2', 'http://www.cnblogs.com/venicid/p/8747383.html', 'http://www.cnblogs.com/venicid/p/8923096.html', ] pool = ThreadPoolExecutor(2) for i in url_list: pool.submit(get_code, i).add_done_callback(print_len) pool.map(get_code, url_list)