Python 进阶 线程池

Python 进阶 线程池

1. 概述

线程池的基类是 concurrent.futures 模块中的 Executor,Executor 提供了两个子类,即 ThreadPoolExecutorProcessPoolExecutor,其中 ThreadPoolExecutor 用于创建线程池,而 ProcessPoolExecutor 用于创建进程池。

2. 用法

Exectuor 提供了如下常用方法:

  • submit(fn, args, **kwargs):将 fn 函数提交给线程池。args 代表传给 fn 函数的参数,kwargs 代表以关键字参数的形式为 fn 函数传入参数。
  • map(func, iterables, timeout=None, chunksize=1):该函数类似于全局函数 map(func, *iterables),只是该函数将会启动多个线程,以异步方式立即对 iterables 执行 map 处理。
  • shutdown(wait=True):关闭线程池。

submit 方法会返回一个 Future 对象,Future 类主要用于获取线程任务函数的返回值。由于线程任务会在新线程中以异步方式执行,因此,线程执行的函数相当于一个“将来完成”的任务,所以 Python 使用 Future 来代表。

Future 提供了如下方法:

  • cancel():取消该 Future 代表的线程任务。如果该任务正在执行,不可取消,则该方法返回 False;否则,程序会取消该任务,并返回 True。
  • cancelled():返回 Future 代表的线程任务是否被成功取消。
  • running():如果该 Future 代表的线程任务正在执行、不可被取消,该方法返回 True。
  • done():如果该 Funture 代表的线程任务被成功取消或执行完成,则该方法返回 True。
  • result(timeout=None):获取该 Future 代表的线程任务最后返回的结果。如果 Future 代表的线程任务还未完成,该方法将会阻塞当前线程,其中 timeout 参数指定最多阻塞多少秒。
  • exception(timeout=None):获取该 Future 代表的线程任务所引发的异常。如果该任务成功完成,没有异常,则该方法返回 None。
  • add_done_callback(fn):为该 Future 代表的线程任务注册一个“回调函数”,当该任务成功完成时,程序会自动触发该 fn 函数。

3. 示例

同步代码,以crawler函数模拟爬虫函数,时间延迟模拟网络IO

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import time


def crawler():
    print('crawl page...')
    time.sleep(2)


def main():
    start = time.time()
    for _ in range(4):
        crawler()
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

大概8秒钟

$ python3 demo00.py
crawl page...
crawl page...
crawl page...
crawl page...
take 8.007 second

3.1 示例1:创建线程池

看一下使用线程池

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor


def crawler():
    print(f'{threading.current_thread().name} crawl page...')
    time.sleep(2)


def main():
    start = time.time()
    # 创建线程池,最大线程数为4,线程名称前缀为 crawler(可选)
    pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
    for _ in range(4):
        pool.submit(crawler)
    pool.shutdown()
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

$ python3 demo01.py
crawler_0 crawl page...
crawler_1 crawl page...
crawler_2 crawl page...
crawler_3 crawl page...
take 2.003 second

3.2 示例2:传参数

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor


def crawler(page_num, index):
    print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
    time.sleep(2)


def main():
    start = time.time()
    pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
    for page_num in range(1, 5):
        # 可传多个参数
        pool.submit(crawler, page_num, page_num - 1)
    pool.shutdown()
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

$ python3 demo02.py
crawler_0 crawl page 1, index: 0crawler_1 crawl page 2, index: 1

crawler_2 crawl page 3, index: 2
crawler_3 crawl page 4, index: 3
take 2.003 second

3.3 示例3:获取返回值

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor


def crawler(page_num, index):
    print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
    time.sleep(2)
    return f'page {page_num} finished'


def main():
    start = time.time()
    pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
    for page_num in range(1, 5):
        future = pool.submit(crawler, page_num, page_num - 1)
        # 打印返回结果
        print(future.result())
    pool.shutdown()
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

运行一下发现???变成同步了。

$ python3 demo03.py
crawl page 1, index: 0
page 1 finished
crawl page 2, index: 1
page 2 finished
crawl page 3, index: 2
page 3 finished
crawl page 4, index: 3
page 4 finished
take 8.009 second

原因是future.result() 会阻塞当前的主线程,只有等它执行完了,才会继续执行下一个submit。这么坑?

可是我就是要获取它的返回值,怎么解决?

还记得前面的add_done_callback(fn)方法吗,传入的函数fn相当于给此线程注册一个回调函数,当线程结束后自动调用,不会造成阻塞。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor


def crawler(page_num, index):
    print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
    time.sleep(2)
    return f'page {page_num} finished'


def get_result(future):
    print(future.result())


def main():
    start = time.time()
    pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
    for page_num in range(1, 5):
        future = pool.submit(crawler, page_num, page_num - 1)
        # print(future.result())
        # 注册回调函数,当线程执行完毕打印返回结果
        future.add_done_callback(get_result)
    pool.shutdown()
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

$ python3 demo03.py
crawler_0 crawl page 1, index: 0
crawler_1 crawl page 2, index: 1
crawler_2 crawl page 3, index: 2
crawler_3 crawl page 4, index: 3
page 1 finished
page 4 finishedpage 2 finished
page 3 finished

take 2.003 second

3.4 示例4:快速创建线程

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor

l1 = [1, 2, 3, 4]
l2 = ['a', 'b', 'c', 'd']


def crawler(page_num, index):
    print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
    time.sleep(2)
    return f'page {page_num} finished'


def main():
    start = time.time()
    pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
    # 使用map函数快速创建线程,类似于python内置的map函数
    futures = pool.map(crawler, l1, l2)
    pool.shutdown()
    # 返回值是一个生成器类型
    print(type(futures))
    print(list(futures))
    end = time.time()
    print(f'take {(end - start):2.3f} second')


if __name__ == '__main__':
    main()

$ python3 demo04.py
crawler_0 crawl page 1, index: a
crawler_1 crawl page 2, index: b
crawler_2 crawl page 3, index: c
crawler_3 crawl page 4, index: d
<class 'generator'>
['page 1 finished', 'page 2 finished', 'page 3 finished', 'page 4 finished']
take 2.003 second

注意:创建现成的说法并不准确,因为线程在创建线程池的时候就产生了,只是在等待执行任务

posted @ 2022-04-07 11:21  王舰  阅读(108)  评论(0编辑  收藏  举报