Python - 并发执行器

并发网络下载
使用 concurrent.futures 启动进程
实现exector.map 方法
显示下载进度并处理错误
- flag2 系列示例处理错误的方式
- 使用 futures.as_completed 函数

并发网络下载

flags.py、flags_threadpool.py、flags_asyncio.py 三个版本分析可得：

对于网络I/O 操作，不管使用哪种并发结构 ————线程或者协程————只要代码写的没有问题，吞吐量都比依次执行的代码高很多。
对于可以控制发起多少请求的HTTP客户端，线程与协程之间的性能差异不明显

对于同时接受很多客户端访问的服务器来说，区别还是有的：协程的伸缩能力更好，因为协程使用的内存比线程少很多，而且没有上下文切换的开销

依序下载的脚本

# flags.py

import time
from pathlib import Path
from typing import Callable

import httpx


POP20_CC = ('CH IN US ID BR PK NG BD RU JP MX PH VN ET EG DE IR TR CD FR').split()
BASE_URL = 'http://mp.ituring.com.cn/files/flags'
DEST_DIR = Path('downloaded')


def save_flag(img: bytes, filenames: str) -> None:  # 1
    (DEST_DIR / filenames).write_bytes(img) 


def get_flag(cc: str) -> bytes:  # 2
    url = f'{BASE_URL}/{cc}/{cc}.gif'.lower()
    resp = httpx.get(url, timeout=6.1,  # 3
follow_redirects=True)  # 4

    resp.raise_for_status() # 5
    return resp.content


def down_many(cc_list: list[str]) -> int: # 6
    for cc in sorted(cc_list): # 7
        image = get_flag(cc)
        save_flag(image, f'{cc}.gif')
        print(cc, end=' ', flush=True) # 8
    return len(cc_list)


def main(downloader: Callable[[list[str]], str]) -> None: # 9
    DEST_DIR.mkdir(exist_ok=True) # 10
    t0 = time.perf_counter()  # 11
    count = downloader(POP20_CC)
    elapsed = time.perf_counter() - t0
    print(f'\n{count} downloads in {elapsed:.2f} s')


if __name__ == '__main__':
    main(down_many)

把image 字节序列保存到 DEST_DIR 目录中，命名为 filenames
指定国家代码，构建url，然后下载图像，返回相应中的二进制内容
最好为网络指定一个合理的超时时间，防止莫名其妙阻塞几分钟。
HTTPX，默认不跟踪重定向，这里无须设置follow_redirects=True，想强调的是httpx和request之间的区别。设置之后可以灵活应对未来的变化，说不定会把图像移到其他地方
5.这个脚本没有处理错误，但是这个方法在HTTP状态码不是2XX 时抛出异常————为免失败后悄无声息。强烈建议调用该方法。
download_many 函数是与并发实现比较的关键
按字母表顺序迭代国家代码列表，明确表明输出的顺序与输入一致。返回下载的国旗数量
在同一行中一次显示一个国家代码，展示下载进度。 end=' '参数把常规的换行符替换为一个空格，在同一行依次显示各个国家代码。flush=Ture 参数不可缺少，因为Python 的输出默认以行为单位缓冲。即Python 只在换行符后显示可打印的字符。
调用main 函数必须传入下载任务的函数。如此一来，我们可以把main作为库函数使用，在threadpool 和 asyncio 示例中出入 down_many 的其他实现
如果需要则创建DEST_DIR 目录，如果目录以存在则不抛出错误
记录并报告dowloader 函数的运行时间

使用concurrent.futures 模块下载

# 实例22-3 flags_threadpool.py：使用futures.ThreadPoolExector 类实现多线程下载的脚本

from concurrent import futures

from flags import save_flag, get_flag, main # 1


def download_one(cc: str): # 2
    image = get_flag(cc)
    save_flag(image, f'{cc}.gif')
    print(cc, end=' ', flush=True)
    return cc


def download_many(cc_list: list[str]) -> int:
    with futures.ThreadPoolExecutor() as excutor: # 3
        res = excutor.map(download_one, sorted(cc_list)) # 4
 
    return len(list(res)) # 5


if __name__ == '__main__':
    main(download_many) # 6

重用 flags 模块中的几个函数
下载单个图像的函数。这是在各个职程中执行的函数
实例化 ThreadPoolExector，作为上下文管理器。 exector._exit_ 方法将调用 exector.shutdown(wait=True)，在所有线程都执行完毕前阻塞线程
map 方法的作用与内置函数map类似，不过download_one 函数会哎多个线程中并发调用。map方法返回一个生成器，通过迭代可以获取各个函数调用返回的值 ————这里，每次调用download_one 返回一个国家代码
返回获取的结果数量。如果有i安城抛出异常。那么异常在list 构造方法尝试从exector.map 返回的迭代器中获取相应的返回值时抛出。
调用flags 模块中的main函数，传入download_many函数的并发版

ThreadPoolExector 构造函数接受多个参数，这里没有用到。其中，第一个参数最为重要，即max_workers。该参数设置最多执行多少个工作线程。max_workers 的值为None时(默认值)，从Python3.8开始，ThreadPoolExector 使用以下表达式决定线程数量。

max_workers = min(32, os.cpu_count() + 4)

ThreadPoolExector 文档解释了这么做的依据：

这个默认值为I/O 密集型任务保留至少5个职程。对于那些释放了GIL的CPU 密集型任务，最多使用32个CPU核。这样能够在多核设备中不知不觉使用特大量资源。
现在，ThreadPoolExector 在启动max_workers 个工作线程之前也会重用空闲的工作线程

future 对象在哪里

future 对象是 concurrent.future 模块和 asyncio 包的核心组件，可是，作为这两个库的用户，我们有时却见不到future 对象。示例20-3 在背后用到了future对象，但是我编写的代码没有直接使用它。

从Python 3.4 起，标准库中有两个名为Future的类： concurrent.futures.Futur 和 asyncio.Future。二者的作用相同：两个Future 类的实例都表示可能已经完成或者尚未完成的延时计算。这与Twisted 中的Deferred类、Tornado中的Future类，以及现代JavaScript 中的 Promise 类似。

future对象封装待完成的操作，可以放入队列，完成的状态可以查询，得到结果(异常)后可以获取

但是要记住一点，future 对象不应自己动手创建，只能由并发框架(concurrent.futures 或 asyncio) 实例化。原因很简单：future 对象表示终将运行的操作，必须排期运行，而这是框架的工作。具体而言，只有把可调用对象交给concurrent.futures.Exector 子类执行时才创建concurrent.futures.Future 实例。例如，Exector.submit() 方法接受一个可调用对象，排期执行，返回一个Future 实例。

应用程序代码不应该改变future 对象的状态，并发框架在future 对象表示的延迟计算结束后改变future对象的状态，而我们无法掌控计算何时结束。

两种future对象都有.done() 方法。该方法不阻塞，返回一个布尔值，指明future 对象包装的可调用对象是否已经执行。客户端代码通常不询问future对象是否运行结束，而是等待通知。因此，两个Future类都有.add_done_callback() 方法：提供一个可调用对象，在future对象执行完毕后调用；这个可调用对象的唯一参数是future对象。注意，回调的这个可调用对象与future对象包装的函数在同一个工作线程或进程中运行。

此外，还有.result() 方法。该方法在两个Future 类中的作用相同，当future 对象运行结束后，返回可调用对象的结果，或者重新抛出可执行调用对象时抛出的异常。可是，如果future对象没有运行结束，那么result 方法在两个Future 类中的行为相差甚远。对于concurrency.futures.Future 实例，调用f.result() 方法将阻塞调用方所在的线程，直到有结果返回。此时，result 方法可以接受可选的timeout 参数，如果在指定的时间内future对象没有运行完毕，则抛出TimeoutError异常。asyncio.Future.result 方法不支持设定超时时间，对于asyncio 包，获取future对象的结果首选 await。不过await 对concurrency.futures.Future 不起作用。

这两个库中有几个函数返回future对象，其他函数则使用future对象，以用户易于理解的方式实现自身。示例22-3 中的Exector.map 方法属于后者，它返回一个迭代器，迭代器的_next_ 方法调用各个future 对象的resutl 方法，因此我们得到的是各个future 对象的结果，而非future 对象本身。

为了从实用的角度理解future对象，可以使用 concurrency.futures.as_completed 函数重写示例20-3。这个函数的参数使一个future对象构成的可迭代对象，返回值是一个迭代器，在future 对象运行结束后产出 future 对象。

为了使用futures.as_complete 函数，只需要修改download_many 函数，把较高级的exector.map 调用换成两个for循环：一个用于创建并排定future对象，另一个用于获取future对象的结果。同时，我们将添加几个print调用，显示运行结束前后的future对象。修改后的download_many 函数如示例20-4 所示，代码行数由5增加到17，不过现在我们能一窥神秘的future对象了。其他函数不变，与示例20-3 中一样。

# 示例20-4 flags_threadpool_futures.py: 把download_many 函数中的exector.map 换成exector.submit 和 futures.as_completed

def download_many(cc_list: list[str]) -> int:

    with futures.ThreadPoolExecutor() as excutor: # 1
        to_do: list[futures.Future] = []
        for cc in sorted(cc_list): # 2
            future = excutor.submit(download_one, cc) # 3
            to_do.append(future) # 4
            print(f'Schedule for {cc}: {future}') # 5

        for count, future in enumerate(futures.as_completed(to_do), 1): # 6
            res: str = future.result() # 7
            print(f'{future} result: {res !r}') # 8

    return count

把max_workers 设为3，以便在输出中观察待完成的future对象
按照字母表顺序迭代国家代码，强调返回的结果是无序的
exector.submit() 方法排定可调用对象的执行时间，返回一个future对象。表示待执行的操作
存储各个future对象，后面传给as_completed 函数
显示一个消息，包含国家代码和对象的future对象
as_completed 函数在future 对象运行结束后产出future 对象
获取future 对象的结果
显示future 对象及其结果

注意，在这个示例中调用future.resutl()方法绝不会阻塞，因为future 由as_completed 函数产出。运行示例20-4 得到的输出如示例20-5所示

# 示例 20-5  
Schedule for BR: <Future at 0x27f46009000 state=running> # 1
Schedule for CH: <Future at 0x27f460099f0 state=running>
Schedule for ID: <Future at 0x27f4600a260 state=running>
Schedule for IN: <Future at 0x27f4600aad0 state=pending> # 2
Schedule for US: <Future at 0x27f4600ab90 state=pending>
IDCH  <Future at 0x27f4600a260 state=finished returned str> result: 'ID'
<Future at 0x27f460099f0 state=finished returned str> result: 'CH'
BR <Future at 0x27f46009000 state=finished returned str> result: 'BR' # 3
IN <Future at 0x27f4600aad0 state=finished returned str> result: 'IN'
US <Future at 0x27f4600ab90 state=finished returned str> result: 'US'

5 downloads in 0.39 s

按字母顺序排定future对象。future对象的repr() 方法显示future对象的状态；可以看到，前3个 future 对象的状态是running，因为由3个工作线程。
后两个future 对象的状态是pending，等待有线程可用。
这一行里的第一个BR 是运行在一个工作线程中的download_one 函数输出的，随后的内容是download_many 函数输出的。

下面简单说明如何在CPU密集型作业中使用concurrent.futures 轻松绕开 GIL

使用 concurrent.futures 启动进程

concurrent.futures 模块的文档副标题是 "Launching parallel tasks"(执行并行任务)、这个模块实现的是真正的并行计算，因为它使用ProcessPoolExector 类把工作分配给多个Python 进程处理。

ProcessPoolExector 和 ThreadPoolExector 类都实现了 Exector 接口，因此使用 concurrent.futures 模块能特别轻松地把基于线程的方案转成基于进程的方案。

下载国企的示例或其他I/O 密集型作业，使用ProcessPoolExector 类得不到什么好处。这一点易于验证，只需要把示例 20-3 中下面这几行：

    def download_many(cc_list: list[str]) -> int:
        with futures.ThreadPoolExecutor() as excutor:

改成：

   def download_many(cc_list: list[str]) -> int:
       with futures.ProcessPoolExecutor() as excutor:

ProcessPoolExcector 构造函数也有一个max_workers 参数，默认值为None。这里，执行器限制职程的数量不能超过os.cpu_count() 返回的数字。

相交线程，进程使用的内存更多，启动时间更长，因此ProcessPoolExector 的价值在CPU密集型作业中才能体现出来。下面回到19.6节中的素数检测，使用concurrent.futures 重写。

ThreadPoolExecutor和ProcessPoolExecutor 的区别：

ThreadPoolExecutor.__init__方法需要max_worker参数，指定线程池中线程的数量。在ProcessPollExecutor类中，这个参数是可选的，而且大多数情况下不使用,默认值是os.cpu_count() 函数返回的cpu数量。因为这样处理说的通，因为对CPU密集型的处理来说，不可能要求使用超过CPU数量的职程，而对I/O密集型处理来说，可以在一个TheadPoolExecutor实例中使用10个，100个或1000个线程，最佳线程数取决于做的是什么事儿，以及可用内存有多少，因此要仔细测试才能找到最佳的线程数

重写多核版素数检测程序

# 示例 22-6： 使用ProcessPoolExector 重写procspy
import sys
from concurrent import futures  # 1
from time import perf_counter
from typing import NamedTuple

from primes import is_prime, NUMBERS


class PrimeResult(NamedTuple): # 2
    n: int
    flag: bool
    elapsed: float


def check(n: int) -> PrimeResult:
    t0 = perf_counter()
    res = is_prime(n)
    return PrimeResult(n, res, perf_counter() - t0)


def main() -> None:
    if len(sys.argv) < 2:
        workers = None  # 3
    else:
        workers = int(sys.argv[1])
    exector = futures.ProcessPoolExecutor(workers) # 4
    actual_workers = exector._max_workers  # type: igonre      # 5 

    print(f'Checking {len(NUMBERS)} numbers with {actual_workers} processes:')
    t0 = perf_counter()

    numbers = sorted(NUMBERS, reverse=True) # 6
    with exector: # 7
        for n, prime, elapsed in exector.map(check, numbers): # 8
            label = 'P' if prime else ' '
            print(f'{n:16} {label} {elapsed:9.6f}s')

    time = perf_counter() - t0
    print(f'Total time:{time:.2f}s')


if __name__ == '__main__':
    main()

没有必要导入mutiprocessing、SimpleQueue 等。一切都隐藏在concurrent.futures 背后。
PrimeResult 元组和 check函数与procs.py 中的一样，但是现在不需要那些队列和worker函数了。
未提供命令行参数时，我们不自己决定workers 的数量，而是把值设未None，让ProcessPoolExector 来决定
在7 长故意的with块之前构建ProcessPoolExector 实例，以便在下一行显示职程的具体数量
_max_workers 是ProcessPoolExector 的实例属性，文档中没有记载。我决定使用它显示 workers 变量的值为None 时有多少职程。不出所料，Mypy 报错了，因此我加上了 type: igonre 注释，用来静默报错。
倒序排列要检查的数。这里将显示 proc_pool.py 与 proc.py 在行为上的差别。详见本例后面的说明
是红exector 作为上下文管理器
exector.map 调用返回check返回的PrimeResult 实例。顺序与numbers 参数相同

运行示例20-6，会发现结果的出现顺序完全是倒序的，如示例20-7 所示。相比之下，procs.py的输出(示例19-13)则取决于各个数的素数检测难度。例如，procs.py 在靠近顶部的位置显示7777777777777777的结果，因为它有一个较小的因子7，所以is_prime 很快就能判断它不是素数。

相比之下，7777777536340681，is_prime 要用很长时间才能判断它是合数。判断 7777777777777753 是素数的时间更长————因此，在procs.py 的输出中，这两个数出现在靠近末端的位置

运行proc_pool.py，你不仅会注意到结果倒序显示，还会发现显示 9999999999999999 的结果之后，程序卡住。

# 示例20-7 proc_pool.py 的输出
Checking 20 numbers with 16 processes:
9999999999999999    0.000005s  # 1
9999999999999917 P  4.352024s  # 2
7777777777777777    0.000007s  # 3
7777777777777753 P  4.133456s 
7777777536340681    4.024149s
6666667141414921    3.915010s
6666666666666719 P  3.866880s
6666666666666666    0.000003s
5555555555555555    0.000006s
5555555555555503 P  3.639214s
5555553133149889    3.640252s
4444444488888889    3.602341s
4444444444444444    0.000002s
4444444444444423 P  3.584374s
3333335652092209    3.098235s
3333333333333333    0.000006s
3333333333333301 P  3.174799s
 299593572317531 P  0.940279s
 142702110479723 P  0.666768s
               2 P  0.000002s
Total time:4.49s

这一行很快就显示
这一行用4.3秒才显示
余下各行几乎立即显示

proc_pool.py 的行为缘何如此？原因如下：

前面说过，exector.map(check,numbers) 返回结果的顺序与numbers 中数的顺序一致。
proc_pool.py 默认使用的职程数量与CPU核数量相等————max_workers 为None 时，ProcessPoolExector 的行为，在我的笔记本计算机中，是16个进程
由于numbers 是倒序提交的，因此首先检测9999999999999999 。该数的因子是9，得到的结果速度很快
第二个数是9999999999999917 ，这是样本中最大的素数，检测用时比其他数都长
同时，余下的11个进程检测其他数，它们可能是素数、因子较大的合数，或者因子非常小的合数
当负责检测9999999999999917 的职程最终判断它是素数之后，其他进程已经完成工作，因此结果立即显示出来

还有一点需要说明的的是如果将示例20-6 的ProcessPoolExector 更换为 ThreadPoolExector 计算的总耗时 20S，进一步说明 CPU密集型任务适合使用 ProcessPoolExector

实现exector.map 方法

本节研究Exector.map：


from concurrent import futures
from time import sleep, strftime


def display(*args): # 1
    print(strftime('[%H:%M:%S]'), end=' ')
    print(*args)


def loiter(n): # 2
    msg = '{}loiter({}): doing nothing for {}s...'
    display(msg.format('\t' * n, n, n))
    sleep(n)
    msg = '{}loiter({}): done'
    display(msg.format('\t' * n, n))
    return n * 10 # 3


def main():
    print('Script starting')
    with futures.ThreadPoolExecutor(max_workers=3) as executor:  # 4
        results =  executor.map(loiter, range(5))  # 5
        display('results:', results) # 6
        for i, result in enumerate(results):  # 7
            display('result {}:{}'.format(i, result)) 

if __name__ == '__main__':
    main()

这个函数很简单，把掺入的参数打印出来，并在前面加上[HH:MM:SS] 格式的时间戳
loiter函数的作用更简单，只是在开始时显示一个消息，然后休眠n秒，最后在结束时再显示一个消息：消息使用制表符缩进，缩进量由n的值确定
loiter 函数返回 n * 10,以便让我们了解收集结果的方式
创建ThreadPoolExecutor示实例，有3个线程
把5个任务提交给exector。因为只有3个线程，所有只有3个任务立即开始：loiter(0)、loiter(1)、loiter(2)，这些都是非阻塞调用
立即显示调用exector.map 方法的结果：一个生成器，如示例20-9中的输出所示。
for 循环中的 enumerate 函数隐式调用next(results)，这个函数又在(内部)表示第一个任务(loiter(0)) 的futuer对象_f 上调用_f.result()。result方法会阻塞，直到future 对象运行结束，因此这个循环每次迭代都要等待下一个结果做好准备。

示例20-9是某次运行示例20-8得到的输出。

# 示例20-9 某次运行 demo_exector_map.py 得到的输出
Script starting
[23:10:39] loiter(0): doing nothing for 0s... # 1
[23:10:39] loiter(0): done
[23:10:39] 	loiter(1): doing nothing for 1s... # 2
[23:10:39] 		loiter(2): doing nothing for 2s...
[23:10:39] 			loiter(3): doing nothing for 3s...[23:10:39]
 results: <generator object Executor.map.<locals>.result_iterator at 0x000001BC1D181A10> # 3
[23:10:39] result 0:0 # 4
[23:10:40] 	loiter(1): done # 5
[23:10:40] 				loiter(4): doing nothing for 4s...
[23:10:40] result 1:10 # 6
[23:10:41] 		loiter(2): done # 7
[23:10:41] result 2:20
[23:10:42] 			loiter(3): done
[23:10:42] result 3:30
[23:10:44] 				loiter(4): done # 8
[23:10:44] result 4:40

Process finished with exit code 0

第一个线程执行loiter(0) ，因此休眠0秒，甚至在第二个线程开始之前就结束，不过具体情况因人而异
loiter(1) 和 loiter(2) 立即开始(因为线程池中有3个职程，可以并发运行3个函数)。
这一行表明，exector.map 方法返回的结果(results) 是一个生成器。不管有多少任务，也不管max_workers 的值是多少，目前都不会阻塞
此时执行过程可能阻塞，具体情况取决于传给loiter函数的参数：results 生成器的_next_ 方法必须等到第一个futuer对象运行结束。此时不会阻塞，因为loite(0)在循环开始前已结束。注意，这一点之前的所有事件都在同一时刻发生，即15:56:50.
1秒后，即23:10:40，loiter(1)运行完毕。这个线程闲置，可以开始运行loiter(4)
显示loiter(1) 的结果：10。现在，for 循环会阻塞，等待loiter(2)的结果
同上，loiter(2) 运行结束，显示结果; loiter(3) 也一样
2秒后，loiter(4) 运行结束，因为loiter(4) 在 23:10:40 时开始，空等了4秒

Exector.map 函数易于使用，不过通常最好等结果准备好之后再获取，不要考虑提交顺序的问题。为此，要把Exector.submit 方法和futures.as_completed 函数结合起来使用，像示例20-4 那样

exector.submit 和 futures.as_completed 这个组合比 exector.map更灵活，因为submit 方法能处理不同的可调用对象和参数，而exector.map 只适用于不同参数调用同一个可调用对象。此外，传给futures.as_completed 函数的一系列future 对象可以来自多个执行器，例如一些由 ThreadPoolExector 实例创建，另一些由ProcessPoolExector 实例创建

# ThreadPoolExecutor 写法二
def display(*args):
    print(strftime('[%H:%M:%S]'), end=' ')
    print(*args)


def loiter(n):
    msg = '{}loiter({}): doing nothing for {}s...'
    display(msg.format('\t' * n, n, n))
    sleep(n)
    msg = '{}loiter({}): done'
    display(msg.format('\t' * n, n))
    return n * 10


def main():
    print('Script starting')
    to_do = []
    with futures.ThreadPoolExecutor(max_workers=3) as executor:  # max_workers=3: 指定线程池中的线程数
        for n in range(5):
            future = executor.submit(loiter, n)
            to_do.append(future)

        for i, future in enumerate(futures.as_completed(to_do)):
            display('result {}:{}'.format(i, future.result()))

多进程实现版本：

def display(*args):
    print(strftime('[%H:%M:%S]'), end=' ')
    print(*args)


def loiter(n):
    msg = '{}loiter({}): doing nothing for {}s...'
    display(msg.format('\t' * n, n, n))
    sleep(n)
    msg = '{}loiter({}): done'
    display(msg.format('\t' * n, n))
    return n * 10


def main():
    print('Script starting')
    with futures.ProcessPoolExecutor() as executor:  # max_workers=3: 指定线程池中的线程数
        results =  executor.map(loiter, range(5))  # 将 0~4 传递给loiter函数，返回一个迭代器
        display('results:', results)
        for i, result in enumerate(results):  # 取loiter返回的每个结果
             display('result {}:{}'.format(i, result))

if __name__ == '__main__':
    main()   # 


# out:
'''
Script starting
[14:05:18] results: <generator object _chain_from_iterable_of_lists at 0x000001D7C5266B90>
[14:05:18] loiter(0): doing nothing for 0s...
[14:05:18] loiter(0): done
[14:05:18] 	loiter(1): doing nothing for 1s...
[14:05:18] result 0:0
[14:05:18] 		loiter(2): doing nothing for 2s...
[14:05:18] 			loiter(3): doing nothing for 3s...
[14:05:18] 				loiter(4): doing nothing for 4s...
[14:05:19] 	loiter(1): done
[14:05:19] result 1:10
[14:05:20] 		loiter(2): done
[14:05:20] result 2:20
[14:05:21] 			loiter(3): done
[14:05:21] result 3:30
[14:05:22] 				loiter(4): done
[14:05:22] result 4:40

Process finished with exit code 0
'''

main()函数一定要写在 if _name_ == 'main': 中，否则会多进程循环调用

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

参考：https://www.likecs.com/show-305534761.html#sc=1800

显示下载进度并处理错误

前面几节脚本没有处理错误，这样做是为了便于阅读和比较3中方案(依序、多线程和异步) 的结构

为了处理可能出现的各种错误，创建了 flags2 系列示例

flags2_common.py
这个模块中包含了所有flags2 示例通用的函数和设置，例如main函数，负责解析命令行参数、计时和报告结果。这个脚本中的代码其实时提供支持的
flags2_sequential.py
能正确处理错误，以及显示进度条的依序下载版HTTP客户端。flags2_threadpool.py 脚本会用到这个模块里的download_one 函数
flags2_threadpool.py
基于futures.ThreadPool 类实现的并发版HTTP 客户端，演示如何处理错误，以及集成进度条
flags2_asyncio.py
与前一个脚本的作用相同，不过使用asynciof 和 httpx 实现

flag2 系列示例处理错误的方式

这3个示例在负责下载一个文件的函数(download_one) 中使用相同的策略处理HTTP 404 错误。其他异常则向上冒泡，交给download_many 函数或 supervisor 协程处理

我们还是线分析依序下载版本的代码，因为这一版更易于理解，而且使用线程池的脚本重用了这里的大部分代码。示例20-14 给出的是flags_sequential.py 和 flags2_threadpool 脚本真正用于下载的函数。

from collections import Counter
from http import HTTPStatus

import httpx
import tqdm  # type: ignore # 1
from flags2_common import main, save_flag, DownloadStatus  # 2

DEFAULT_CONCUR_REQ = 1
MAX_CONCUR_REQ = 1


def get_flag(base_url: str, cc: str) -> bytes:
    url = f'{base_url}/{cc}{cc}.gif'.lower()
    resp = httpx.get(base_url, timeout=3.1, follow_redirects=True)
    resp.raise_for_status() # 3
    return resp.content


def download_one(cc: str, base_url: str, verbose: bool = False) -> DownloadStatus:
    try:
        image = get_flag(base_url, cc)
    except httpx.HTTPStatusError as exc: # 4
        res = exc.response
        if res.status_code == HTTPStatus.NOT_FOUND:  # 5:
            status = DownloadStatus.NOT_FOUND
            msg = f'not found: {res.url}'
        else:
            raise  # 6
    else:
        save_flag(image, f'{cc}.gif')
        status = DownloadStatus.OK
        msg = 'OK'
    if verbose:  # 7
        print(cc, msg)

    return status

导入显示进度条的tqdm库，让Mypy 跳过检查
从flag2_common 模块中导入两个函数和一个枚举
如果HTTP状态码不在range(200,300) 范围内，抛出TTTPStatusError
download_one 函数捕获HTTPStatusError，特别处理HTTP 404 错误
把局部变量status 设为 DownloadStatus.NOT_FOUND。DownloadStatus 是从flags2_common.py 中导入的一个枚举
重新抛出其他HTTPSStatusError 异常，向上冒泡，传给调用方。
如果在命令行中设定了 -v/--verbose 选项，显示国家代码和状态消息。这就是详细模式中看到的进度信息。

def download_many(cc_list: list[str],
                base_url: str,
                verbose: bool,
                  _unused_concur_req: int)  -> Counter[DownloadStatus]:
    counter: Counter[DownloadStatus] = Counter() # 1
    cc_iter = sorted(cc_list) # 2
    if not verbose:
        cc_iter = tqdm.tqdm(cc_iter) # 3
    for cc in cc_iter:
        try:
            status = download_one(cc,base_url, verbose) # 4
        except httpx.HTTPStatusError as exc: # 5
            error_msg = 'HTTP error {resp.status_code} - {resp.reason_phrase}'
            error_msg = error_msg.format(resp=exc.response)
        except httpx.RequestError as exc: # 6
            error_msg = f'{exc} {type(exc)}'.strip()
        except KeyboardInterrupt: # 7
            break
        else: # 8
            error_msg = ''
        if error_msg:
            status = DownloadStatus.ERROR # 9
        counter[status]  += 1 # 10
        if verbose and error_msg: # 11
            print(f'{cc} error:{error_msg}')
    return counter # 12

这个Counter 实例用于统计不同的下载状态：DownloadStatus.OK、DownloadStatus.NOT_FOUND 或 DownloadStatus.ERROR
cc_iter 存放通过参数传入的国家代码列表，按字母表顺序排列
如果不是详细模式，则把cc_iter 传给tqdm 函数，返回一个迭代器，产出cc_iter 中的项。同时显示进度条动画。
不断调用download_one 函数
get_flag 抛出的HTTP 状态码异常和未被 download_one 处理的异常在这里处理
其他与网络有关的异常在这里处理。除此之外的异常会中止脚本，因为调用download_many 函数的flags2_common.main 函数中没有 try/except 块
用户按Ctrl + C 组合键时退出循环
如果没有异常从 download_one 函数中逃出，则清空错误消息
如果有错误，则把局部变量 status 设为相应的状态
递增相应状态的计数器
如果是详细模式，而且有错误，则显示带有当前国家代码的错误消息
返回counter，以便main函数能在最终的报告中显示数量

使用 futures.as_completed 函数

为了集成tqdm 进度条，并处理各个请求中的错误，flags2_threadpool.py 脚本用到我们见过的futures.ThreadPoolExector 类和 futures.as_completed 函数。示例 22-16 是flags2_threadpooy.py 脚本的完整代码请求。这个脚本只实现了 download_many 函数，其他函数都重用 flags2_common.py 和 flags2_sequential.py 脚本里的

# 示例20-16 flags2_threadpool.py：完整的代码清单
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed

import httpx
import tqdm

from flags2_common import main, DownloadStatus
from flags2_sequential import download_one  # 1

DEFAULT_CONCUR_REQ = 30  # 2
MAX_CONCUR_REQ = 1000  # 3


def download_many(cc_list: list[str],
                  base_url: str,
                  verbose: bool,
                  concur_req: int) -> Counter[DownloadStatus]:
    counter: Counter[DownloadStatus] = Counter()
    with ThreadPoolExecutor(max_workers=concur_req) as exector:  # 4
        to_do_map = {}  # 5
        for cc in sorted(cc_list):  # 6
            future = exector.submit(download_one, cc, 
                                    base_url, verbose) # 7 
            to_do_map[future] = cc # 8

        done_iter = as_completed(to_do_map) # 9
        if not verbose:
            done_iter = tqdm.tqdm(done_iter, total=len(cc_list))  # 10
  
        for future in done_iter: # 11
            try:
                status = future.result() # 12
            except httpx.HTTPStatusError as exc: # 13
                error_msg = 'HTTP error {resp.status_code} - resp.reason_phrase'
                error_msg = error_msg.format(resp=exc.response)
            except httpx.RequestError as exc:
                error_msg = f'{exc} {type(exc)}'.strip()
            except KeyboardInterrupt:
                break
            else:
                error_msg = ''

            if error_msg:
                status = DownloadStatus.ERROR
            counter[status] += 1
            if verbose and error_msg:
                cc = to_do_map[future]
                print(f'{cc} error: {error_msg}')
    return counter


if __name__ == '__main__':
    main(download_many, DEFAULT_CONCUR_REQ, MAX_CONCUR_REQ)

重用flags2_sequential模块中的download_one 函数
如果没有命令行中指定 -m/--max_req 选项，则使用这个值作为并发请求数的最大值，也就是线程池的大小；真实的数量可能会比较少，例如下载国家的国旗数量较少
不管要下载多少国旗，也不管 -m/--max_req 命令行选项的值是多少，MAX_CONCUR_REQ 会限制最大的并发请求。这是一项安全措施，免得启动太多线程，消耗过多内存
把max_workers 设为concur_req，创建exector。main函数会把下面这3个值中最小的那个赋给concur_req：MAX_CONCUR_REQ、cc_list 的长度、-m/--max_req 命令行选项的值。这样能避免创建过多的线程。
这个字典把各个Futrue 实例(表示一次下载) 映射到相应的国家代码上，在处理错误时使用
按字母表顺序迭代国家代码的列表。结果的顺序主要由HTTP 相应的时间长短决定，不过，如果线程池的大小(由concur_req 设定) 比len(cc_list) 小得多，那么可能会按字母表顺序批量下载。
每次调用 extctor.submit 方法排定一个可调用对象的执行时间，返回一个Future实例。第一个参数是可调用对象，余下的参数是传给可调用对象的参数
把返回的future 和国家代码存储在字典中
future.as_completed 函数返回一个迭代器，在每个任务运行结束后产出future对象。
如果不是详细模式，则把 as_completed 函数返回的结果传给tqdm 函数，显示进度条；因为done_iter 没有长度，所以我们必须通过total=参数告诉tqdm 函数预期的项数，这样tqdm 才能预计剩余的工作量
迭代运行结束后的future对象
在future 对象上调用 result 方法，要么返回可调用对象的返回值，要么抛出可调用共对象在执行过程中捕获的异常。这个方法可能会阻塞，等待确定结果；但是，在这个示例中不阻塞，因为as_completed 函数只返回已经运行结束的future对象
处理可能出现的异常。这个函数余下的代码与依序下载版download_many 函数一样，唯有下一点除外
为了给错误消息提供上下文，以当前future 为键，从to_do_map 中获取国家代码。在依序下载版本中无须这么做，因为那一版迭代的是国家代码，知道国家代码是什么，而这里迭代的是future 对象

Python 线程特别适合I/O 密集型应用程序 concurrent.futures 包大大简化了某些使用场景下Python 线程的用法。另外，使ProcessPoolExector 还可以利用多核解决CPU 密集型问题————如果 "高度并行" 计算的话。

posted @ 2022-12-02 23:19 chuangzhou 阅读(411) 评论(0) 收藏举报

刷新页面返回顶部

认真的活在当下

Python - 并发执行器

并发网络下载

依序下载的脚本

使用concurrent.futures 模块下载

future 对象在哪里

使用 concurrent.futures 启动进程

实现exector.map 方法

显示下载进度并处理错误

flag2 系列示例处理错误的方式

使用 futures.as_completed 函数

公告

认真的活在当下

Python - 并发执行器

并发网络下载

依序下载的脚本

使用concurrent.futures 模块下载

future 对象在哪里

使用 concurrent.futures 启动进程

实现exector.map 方法

显示下载进度并处理错误

flag2 系列 示例处理错误的方式

使用 futures.as_completed 函数

公告

flag2 系列示例处理错误的方式