异步爬虫-aiohttp+多任务异步协程实现

一、aiohttp模块引出

1.1、flask 模拟web server

from flask import Flask
import time

app = Flask(__name__)

@app.route('/a')
def a():
    time.sleep(2)
    return 'hello word a'

@app.route('/b')
def b():
    time.sleep(2)
    return 'hello word b'

@app.route('/c')
def c():
    time.sleep(2)
    return 'hello word c'
	
if __name__ == '__main__':
    app.run(threaded=True)

1.2、使用异步爬虫对flask server进行爬取

  • requests 模块发起的请求是基于同步的;
import requests
import asyncio
import time

start = time.time()
urls = [
    'http://127.0.0.1:5000/a',
    'http://127.0.0.1:5000/b',
    'http://127.0.0.1:5000/c',
]

async def get_page(url):
    print('正在下载:', url)
    response = requests.get(url=url)
    print('下载完成:', response.text)

tasks = []
for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)


loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('总耗时:', end-start)

# 执行结果
	正在下载: http://127.0.0.1:5000/a
	下载完成: hello word a
	正在下载: http://127.0.0.1:5000/b
	下载完成: hello word b
	正在下载: http://127.0.0.1:5000/c
	下载完成: hello word c
	总耗时: 6.046119928359985

二、aiohttp+多任务异步协程实现

  • 引出aiohttp : pip3 install aiohttp 基于异步网络请求的模块
  • 使用aiohttp 模块中的ClientSession类型的session对象进行异步网络请求
import requests
import asyncio
import time
import aiohttp

start = time.time()
urls = [
    'http://127.0.0.1:5000/a',
    'http://127.0.0.1:5000/b',
    'http://127.0.0.1:5000/c',
]

async def get_page(url):
    print('正在下载:', url)
    # requests 模块是基于同步发起的网络请求,必须使用基于异步的网络请求模块进行对指定url发起请求
    # response = requests.get(url=url)

    # aiohttp模块是基于异步的网络请求模块
    async with aiohttp.ClientSession() as session:
        async with await session.get(url) as response:
            # text() 返回字符串形式的相应对象
            # read() 返回的是二进制形式的相应数据
            # json() 返回的是json对象
            print('下载完成:', response.text())

tasks = []
for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)


loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('总耗时:', end-start)

# 执行结果
	正在下载: http://127.0.0.1:5000/a
	正在下载: http://127.0.0.1:5000/b
	正在下载: http://127.0.0.1:5000/c
	下载完成: <coroutine object ClientResponse.text at 0x7fcb055cc8c0>
	下载完成: <coroutine object ClientResponse.text at 0x7fcb055ccec0>
	下载完成: <coroutine object ClientResponse.text at 0x7fcb055ccbc0>
	总耗时: 2.0158369541168213
	/Users/daizhe/Desktop/daizhe_study_crawl/多任务异步协程.py:24: RuntimeWarning: coroutine 'ClientResponse.text' was never awaited
	  print('下载完成:', response.text())
	RuntimeWarning: Enable tracemalloc to get the object allocation traceback
  • 上例子报了一个异常 :需要在我们使用 aiohttp 获取响应数据之前也需要进行 await 挂起操作;
import requests
import asyncio
import time
import aiohttp

start = time.time()
urls = [
    'http://127.0.0.1:5000/a',
    'http://127.0.0.1:5000/b',
    'http://127.0.0.1:5000/c',
]

async def get_page(url):
    print('正在下载:', url)
    # requests 模块是基于同步发起的网络请求,必须使用基于异步的网络请求模块进行对指定url发起请求
    # response = requests.get(url=url)

    # aiohttp模块是基于异步的网络请求模块
    async with aiohttp.ClientSession() as session:
        # get()
        # post()
        # headers 参数,添加请求头
        # params 参数,GET请求参数
        # data 参数,POST的body
        # porxy 参数,代理相关配置
        async with await session.get(url) as response:
            # text() 返回字符串形式的相应对象
            # read() 返回的是二进制形式的相应数据
            # json() 返回的是json对象
            # 注意 :获取响应数据之前一定要使用 await 进行手动挂起
            page_test = await response.text()
            print('下载完成:', page_test)

tasks = []
for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)


loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('总耗时:', end-start)

# 执行结果
	正在下载: http://127.0.0.1:5000/a
	正在下载: http://127.0.0.1:5000/b
	正在下载: http://127.0.0.1:5000/c
	下载完成: hello word a
	下载完成: hello word b
	下载完成: hello word c
	总耗时: 2.0060369968414307
posted @ 2021-05-22 15:30  SRE运维充电站  阅读(281)  评论(0编辑  收藏  举报