aiohttp

aiohttp是python3的一个异步模块,分为服务器端和客户端。廖雪峰的python3教程中,讲的是服务器端的使用方法。均益这里主要讲的是客户端的方法,用来写爬虫。使用异步协程的方式写爬虫,能提高程序的运行效率。

1、安装

pip install aiohttp

2、单一请求方法

import aiohttp
import asyncio
 
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
 
async def main(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)
        print(html)
url = 'http://junyiseo.com'
loop = asyncio.get_event_loop()
loop.run_until_complete(main(url))

3、多url请求方法

import aiohttp
import asyncio
 
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()
 
async def main(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)
        print(html)
 
 
loop = asyncio.get_event_loop()  
 
# 生成多个请求方法
url = "http://junyiseo.com"
tasks = [main(url), main(url)]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

4、其他的请求方式

上面的代码中,我们创建了一个 ClientSession 对象命名为session,然后通过session的get方法得到一个 ClientResponse 对象,命名为resp,get方法中传入了一个必须的参数url,就是要获得源码的http url。至此便通过协程完成了一个异步IO的get请求。
aiohttp也支持其他的请求方式

session.post('http://httpbin.org/post', data=b'data')
session.put('http://httpbin.org/put', data=b'data')
session.delete('http://httpbin.org/delete')
session.head('http://httpbin.org/get')
session.options('http://httpbin.org/get')
session.patch('http://httpbin.org/patch', data=b'data')

5、请求方法中携带参数

GET方法带参数

params = {'key1': 'value1', 'key2': 'value2'}
async with session.get('http://httpbin.org/get',
                       params=params) as resp:
    expect = 'http://httpbin.org/get?key2=value2&key1=value1'
    assert str(resp.url) == expect

POST方法带参数

payload = {'key1': 'value1', 'key2': 'value2'}
async with session.post('http://httpbin.org/post',
                        data=payload) as resp:
    print(await resp.text())

6、获取响应内容

resp.status 是http状态码,
resp.text() 是网页内容

async with session.get('https://api.github.com/events') as resp:
    print(resp.status)
    print(await resp.text())

gzip和deflate转换编码已经为你自动解码。

7、JSON请求处理

async with aiohttp.ClientSession() as session:
    async with session.post(url, json={'test': 'object'})

返回json数据的处理

async with session.get('https://api.github.com/events') as resp:
    print(await resp.json())

8、以字节流的方式读取文件,可以用来下载

async with session.get('https://api.github.com/events') as resp:
    await resp.content.read(10) #读取前10个字节

下载保存文件

with open(filename, 'wb') as fd:
    while True:
        chunk = await resp.content.read(chunk_size)
        if not chunk:
            break
        fd.write(chunk)

9、上传文件

url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
 
await session.post(url, data=files)

可以设置好文件名和content-type:

url = 'http://httpbin.org/post'
data = FormData()
data.add_field('file',
               open('report.xls', 'rb'),
               filename='report.xls',
               content_type='application/vnd.ms-excel')
 
await session.post(url, data=data)

10、超时处理

默认的IO操作都有5分钟的响应时间 我们可以通过 timeout 进行重写,如果 timeout=None 或者 timeout=0 将不进行超时检查,也就是不限时长。

async with session.get('https://github.com', timeout=60) as r:
    ...

11、自定义请求头

url = 'http://example.com/image'
payload = b'GIF89a\x01\x00\x01\x00\x00\xff\x00,\x00\x00'
          b'\x00\x00\x01\x00\x01\x00\x00\x02\x00;'
headers = {'content-type': 'image/gif'}
 
await session.post(url,
                   data=payload,
                   headers=headers)

设置session的请求头

headers={"Authorization": "Basic bG9naW46cGFzcw=="}
async with aiohttp.ClientSession(headers=headers) as session:
    async with session.get("http://httpbin.org/headers") as r:
        json_body = await r.json()
        assert json_body['headers']['Authorization'] == \
            'Basic bG9naW46cGFzcw=='

12、自定义cookie

url = 'http://httpbin.org/cookies'
cookies = {'cookies_are': 'working'}
async with ClientSession(cookies=cookies) as session:
    async with session.get(url) as resp:
        assert await resp.json() == {
           "cookies": {"cookies_are": "working"}}

在多个请求中共享cookie

async with aiohttp.ClientSession() as session:
    await session.get(
        'http://httpbin.org/cookies/set?my_cookie=my_value')
    filtered = session.cookie_jar.filter_cookies(
        'http://httpbin.org')
    assert filtered['my_cookie'].value == 'my_value'
    async with session.get('http://httpbin.org/cookies') as r:
        json_body = await r.json()
        assert json_body['cookies']['my_cookie'] == 'my_value'

13、限制同时请求数量

limit默认是100,limit=0的时候是无限制

conn = aiohttp.TCPConnector(limit=30)

14、SSL加密请求

有的请求需要验证加密证书,可以设置ssl=False,取消验证

r = await session.get('https://example.com', ssl=False)

加入证书

sslcontext = ssl.create_default_context(
   cafile='/path/to/ca-bundle.crt')
r = await session.get('https://example.com', ssl=sslcontext)

15、代理请求

async with aiohttp.ClientSession() as session:
    async with session.get("http://python.org",
                           proxy="http://proxy.com") as resp:
        print(resp.status)

https://www.mzihen.com/solution-to-shadoxxxsocks-error-port-already-in-use/

代理认证

async with aiohttp.ClientSession() as session:
    proxy_auth = aiohttp.BasicAuth('user', 'pass')
    async with session.get("http://python.org",
                           proxy="http://proxy.com",
                           proxy_auth=proxy_auth) as resp:
        print(resp.status)

或者通过URL认证

session.get("http://python.org",
            proxy="http://user:pass@some.proxy.com")

16、优雅的关闭程序

没有ssl的情况,加入这个语句关闭await asyncio.sleep(0)

async def read_website():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://example.org/') as resp:
            await resp.read()
 
loop = asyncio.get_event_loop()
loop.run_until_complete(read_website())
# Zero-sleep to allow underlying connections to close
loop.run_until_complete(asyncio.sleep(0))
loop.close()

如果是ssl请求,在关闭前需要等待一会

loop.run_until_complete(asyncio.sleep(0.250))
loop.close()

 

 

 

17、小结

本文从官方翻译而来,有问题可以留言

官方文档
http://aiohttp.readthedocs.io/en/stable/

posted @ 2019-04-12 15:00  逐梦~前行  阅读(1234)  评论(2编辑  收藏  举报