高性能异步协程爬虫

一、简介

在执行某些IO密集型任务的时候，程序常常会因为等待 IO 而阻塞。为解决这一问题，可以考虑使用python中的协程异步。
从 Python 3.4 开始，Python 中加入了协程的概念，但这个版本的协程还是以生成器对象为基础的，在 Python 3.5 则增加了关键字async/await，使得协程的实现更加方便，本文中通过async/await 来实现协程。
python中使用协程最常用的库是asyncio，可以帮我们检测IO（只能是网络IO【HTTP连接就是网络IO操作】），实现应用程序级别的切换（异步IO）。

二、协程基本概念：

event_loop
- 事件循环，相当于一个无限循环，我们可以把一些函数注册到这个事件循环上，当满足条件发生的时候，就会调用对应的处理方法。可通过asyncio.get_event_loop()方法来生成。
coroutine
- 协程对象，我们可以将协程对象注册到事件循环中，它会被事件循环调用。我们可以使用 async 关键字来定义一个方法，这个方法在调用时不会立即被执行，而是返回一个协程对象。这个类似于含有yield关键字的生成器函数，在调用时先返回一个生成器对象。
task
- 任务，它是对协程对象的进一步封装，包含了任务的各个状态。
- 可以通过asyncio.ensure_future(coroutine)方法生成，也可以通过event_loop.create_task(coroutine)方法生成，前者可以不借助event_loop对象。
- 如果有多个任务，我们可以将多个任务添加到一个列表比如tasks中，并将tasks作为参数传入asyncio.wait(tasks)方法，然后将这个整体注册到事件循环中。
- 可以通过task.add_done_callback(callback)方法来给task绑定一个回调函数，在回调函数中，可以通过task.result() 方法来获取task中return的返回值
async/await
- 它是从 Python 3.5 才出现的，专门用于定义协程。其中，async 定义一个协程，await 用来挂起阻塞方法的执行。
- async使用场景：
  - 每个异步方法的定义前面需要使用async来修饰
  - 异步方法中，with as前面加上async代表声明一个支持异步的上下文管理器
- await 后面的对象可以是以下3种格式：
  - 一个原生 coroutine 对象。
  - 一个由 types.coroutine() 修饰的生成器，这个生成器可以返回 coroutine 对象。
  - 一个包含__await方法的对象返回的一个迭代器。

三、loop.run_in_executor()

允许在异步协程中执行同步的阻塞操作。
该函数的作用是将一个同步的函数或方法包装在一个异步任务中，以便在事件循环中异步执行。
语法：
```
await loop.run_in_executor(executor, func, *args)
```
- executor参数是一个可选的concurrent.futures.Executor对象，用于执行同步函数。如果未提供executor，则默认使用None，将会使用默认的线程池执行器。
- func参数是要执行的同步函数对象。
- *args是传递给func函数的参数。

示例1：使用requests进行网络请求

import asyncio
import time
import requests

start1 = time.time()
async def fetch_url(url):
    loop = asyncio.get_event_loop()
    response = await loop.run_in_executor(None, requests.get, url)
    return response.text

async def main():
    url = "https://www.baidu.com"
    tasks = [fetch_url(url) for i in range(25)]
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print('使用协程请求时间：', time.time() - start1)

    start2 = time.time()
    for i in range(25):
        res = requests.get('https://www.baidu.com')
    print('正常请求时间：', time.time() - start2)

结果：

使用协程请求时间： 0.24301767349243164
正常请求时间： 1.5608839988708496

示例2：使用random.randint()

import asyncio
import random

async def generate_random_number():
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, random.randint, 0, 10)

async def main():
    random_number = await generate_random_number()
    print(f"随机数: {random_number}")

asyncio.run(main())

四、aiohttp库

要实现真正的异步，必须要使用支持异步操作的请求方式。aiohttp 是一个支持异步请求的库，利用它和 asyncio 配合我们可以非常方便地实现异步请求操作。
安装：
```
pip3 install aiohttp
```
使用

- 用法可类比requests模块中的Session
- 首先需要实例化一个session对象
  - async with aiohttp.ClientSession() as session
  - 关闭SSL证书验证：
    - async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(verify_ssl=False)) as session
- 然后利用session对象发起get、post等请求，大多数同requests模块，下面介绍几个特殊的参数设置
  - 代理设置
    - 无身份验证：
      - 参数：proxy='http://ip:port'
    - 需要身份验证：
      - 方式一：proxy='http://用户名:密码@ip:port'
      - 方式二：proxy='http://ip:port'，proxy_auth=aiohttp.BasicAuth('用户名', '密码')
  - 超时设置
    - 需要借助aiohttp.ClientTimeout对象设置，比如参数：timeout=aiohttp.ClientTimeout(20)，设置总的超时时间为20秒
- 并发限制
  - 由于aiohttp可以支持非常高的并发量，目标网站可能会在短时间内被爬挂掉，这时就需要借助asyncio.Semaphore来控制并发量
  - 首先需要指定并发量（CONCURRENTCY）来实例化asyncio.Semaphore(CONCURRENTCY)对象，在对应的爬取方法中，使用async with语句将其作为上下文对象即可
- 响应
  - res.status：状态码
  - res.headers：响应头
  - await res.text()：字符串类型响应体
  - await res.read()：字节类型响应体
  - await res.json()：JSON对象响应体（字典）

通过支持异步的mongodb存储库motor进行演示：

import random
import asyncio
import aiohttp
import jsonpath

from motor.motor_asyncio import AsyncIOMotorClient
from functools import partial, wraps
from get_ua import ua

def async_retry(func=None, max_times=10, sleep=0.2, default=None):
    '''
    异步请求重试装饰器

    :param func:
    :param max_times: 默认请求重试10次
    :param sleep: 每次请求重试间隔，默认：0.2秒
    :param default: 所有请求均失败后，返回的默认值
    :return:
    '''

    if func is None:
        return partial(async_retry, max_times=max_times, sleep=sleep, default=default)

    @wraps(func)
    async def wrap_in(*args, **kwargs):
        for _ in range(max_times):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                print(f'retry {_ + 1} times, error: ', e)
                await asyncio.sleep(sleep)
        return default

    return wrap_in

class BookHandler:
    def __init__(self):
        self.semaphore = asyncio.Semaphore(5)  # 设置最大并发量：5
        self.all_pages = 10
        self.detail_ids = set()
        self.list_page_url = 'https://spa5.scrape.center/api/book/?limit=18&offset={offset}'  # 列表页
        self.detail_page_url = 'https://spa5.scrape.center/api/book/{id}/'  # 详情页
        self.mongo_con = AsyncIOMotorClient('mongodb://用户名:密码@host:27017/数据库')
        self.mongo_db = self.mongo_con['数据库名']
        self.mongo_col = self.mongo_db['集合名']

    async def generate_random_number(self):
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(None, random.randint, 0, 10000)

    @async_retry(max_times=10)
    async def fetch(self, url):
        async with self.semaphore:
            headers = {
                'User-Agent': ua.random,
                'Proxy-Tunnel': str(await self.generate_random_number())
            }
            timeout = aiohttp.ClientTimeout(10)  # 设置请求超时时间：10秒
            proxy = '代理地址'  # 代理地址
            proxy_auth = aiohttp.BasicAuth('用户名', '密码')  # 代理验证
            async with aiohttp.ClientSession() as session:
                async with session.get(url, headers=headers, timeout=timeout, proxy=proxy,
                                       proxy_auth=proxy_auth) as res:
                    if res.status != 200:
                        raise Exception(f'status code error: {res.status}')
                    return await res.json()

    async def parse_list(self, url):
        json_data = await self.fetch(url)
        self.detail_ids.update(jsonpath.jsonpath(json_data, '$..id'))

    async def parse_detail(self, url):
        data = await self.fetch(url)
        if data is not None:
            await self.save_data(data)

    async def save_data(self, data):
        if data:
            await self.mongo_col.update_one({'id': data.get('id')}, {'$set': data}, upsert=True)

    async def main(self):
        # 获取所有列表页task
        list_tasks = [self.parse_list(self.list_page_url.format(offset=18 * page)) for page in range(self.all_pages)]

        # 处理列表页，提取详情页id
        await asyncio.gather(*list_tasks)

        # 获取所有详情页task
        detail_tasks = [self.parse_detail(self.detail_page_url.format(id=id)) for id in self.detail_ids]

        # 处理所有详情页
        await asyncio.wait(detail_tasks)
        await self.mongo_con.close()


if __name__ == "__main__":
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy()) # windows系统下需要
    asyncio.get_event_loop().run_until_complete(BookHandler().main())

注意：针对windows系统，避免出现Cannot connect to host xxx.com:443 ssl:default [参数错误。]，需要添加：

asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

此外，asyncio.gather()和asyncio.wait()方法都可以传入多任务，主要区别在于参数的形式以及返回值：

前者需要以位置参数的形式，逐个传入任务，返回值为列表，元素则为每个任务的返回值
后者则需要以列表的形式传入所有任务

五、针对python3.7启动入口函数的方式

启动方式
```
asyncio.run(main()) # main为入口函数
```
这是python3.7新添加的一个函数，不再需要显示的创建事件循环

posted @ 2021-05-27 00:08 eliwang 阅读(345) 评论(0) 编辑收藏举报

刷新页面返回顶部

eliwang

学无止境的小渣渣

高性能异步协程爬虫

一、简介

二、协程基本概念：

三、loop.run_in_executor()

四、aiohttp库

五、针对python3.7启动入口函数的方式

启动方式

公告

eliwang

学无止境的小渣渣

高性能异步协程爬虫

一、简介

二、协程基本概念：

三、loop.run_in_executor()

四、aiohttp库

五、 针对python3.7启动入口函数的方式

启动方式

公告

五、针对python3.7启动入口函数的方式