高性能异步爬虫

提供爬虫性能

爬虫的不同版本：

单线程，同步阻塞
多线程，由于爬虫时IO密集型任务，可以提高性能，但对对于计算密集型，切换过频繁，反而可能会减低性能
多进程，开启多进程，利用多核资源

Python 异步的底层实现？？？

Python 3.5 协程原理

深入理解 Python 异步编程（上）

协程+异步

使用支持异步请求的库 aiohttp 和异步模块asyncio 来实现异步协程爬虫

安装 pip install aiohttp

官方文档链接为：https://aiohttp.readthedocs.io/，它分为两部分，一部分是 Client，一部分是 Server，详细的内容可以参考官方文档。

spider

import asyncio
import aiohttp
import time

start = time.time()


async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    result = await response.text()
    session.close()
    return result


async def request():
    url = 'http://127.0.0.1:5000'
    print('Waiting for', url)
    result = await get(url)
    print('Get response from', url, 'Result:', result)


tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:', end - start)  # Cost time: 3.020002603530884

后端 , 非异步

from flask import Flask
import time

app = Flask(__name__)


@app.route('/')
def index():
    time.sleep(3)
    return 'index page'


app.run(threaded=True)

异步协程+多进程

采用多进程，利用多核资源，进一步加速爬虫。

在最新的 PyCon 2018 上，来自 Facebook 的 John Reese 介绍了 asyncio 和 multiprocessing 各自的特点，并开发了一个新的库，叫做 aiomultiprocess，感兴趣的可以了解下：https://www.youtube.com/watch?v=0kXaLh8Fz3k。

这个库的安装方式是：pip install aiomultiprocess

需要 Python 3.6 及更高版本才可使用。

将上面的代码改为

# -*- coding:utf-8 -*-
import asyncio
import aiohttp
import time

start = time.time()


async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    result = await response.text()
    session.close()
    return result


async def request():
    url = 'http://127.0.0.1:34652'
    print('Waiting for', url)
    result = await get(url)
    print('Get response from', url, 'Result:', result)


tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:', end - start) # Cost time: 3.031712532043457

tips:
python3.7.2 使用 queue可能会出现问题，known bug with Python 3.7.2

参考

爬虫速度太慢？来试试用异步协程提速吧！

Python实战异步爬虫(协程)+分布式爬虫(多进程)

Asyncio并发编程

posted @ 2019-08-11 15:51 写bug的日子阅读(139) 评论(0) 编辑收藏举报

刷新页面返回顶部

写bug的日子

高性能异步爬虫

高性能异步爬虫

协程+异步

异步协程+多进程

参考

公告