爬虫性能之异步IO实现并发
什么是异步IO?
在说异步IO之前,先明确几个概念,什么是同步,什么是异步,什么是协程
同步/异步, 它们是消息的通知机制
所谓同步,就是在发出一个功能调用时,在没有得到结果之前,该调用就不返回。按照这个定义,其实绝大多数函数都是同步调用(例如sin isdigit等)。
但是一般而言,我们在说同步、异步的时候,特指那些需要其他部件协作或者需要一定时间完成的任务。
最常见的例子就是 SendMessage。
该函数发送一个消息给某个窗口,在对方处理完消息之前,这个函数不返回。
当对方处理完毕以后,该函数才把消息处理函数所返回的值返回给调用者。
异步的概念和同步相对。
当一个异步过程调用发出后,调用者不会立刻得到结果。
实际处理这个调用的部件是在调用发出后,
通过状态、通知来通知调用者,或通过回调函数处理这个调用。
什么是并发?什么是并行?
并发(concurrency)和并行(parallellism)是:
- 解释一:并行是指两个或者多个事件在同一时刻发生;而并发是指两个或多个事件在同一时间间隔发生。
- 解释二:并行是在不同实体上的多个事件,并发是在同一实体上的多个事件。
- 解释三:在一台处理器上“同时”处理多个任务,在多台处理器上同时处理多个任务。如hadoop分布式集群
所以并发编程的目标是充分的利用处理器的每一个核,以达到最高的处理性能。
什么是阻塞?什么是非阻塞?
阻塞调用是指调用结果返回之前,当前线程会被挂起。函数只有在得到结果之后才会返回。
非阻塞和阻塞的概念相对应,指在不能立刻得到结果之前,该函数不会阻塞当前线程,而会立刻返回。
爬虫性能相关
在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。
1.同步执行
import requests def fetch_async(url): response = requests.get(url) return response url_list = ['http://www.github.com', 'http://www.bing.com'] for url in url_list: fetch_async(url)
特点:一个一个执行,前一个请求完成之后才会执行下一个请求
2.多线程执行
from concurrent.futures import ThreadPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ThreadPoolExecutor(5) #线程池大小 for url in url_list: pool.submit(fetch_async, url) pool.shutdown(wait=True)
特点:开销小,切换速度快,但是编程调试相对复杂
3.多线程加回调函数执行
from concurrent.futures import ThreadPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response def callback(future): print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ThreadPoolExecutor(5) for url in url_list: v = pool.submit(fetch_async, url) v.add_done_callback(callback) #请求完成后自动执行此回调函数 pool.shutdown(wait=True)
4.多进程执行
from concurrent.futures import ProcessPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ProcessPoolExecutor(5) for url in url_list: pool.submit(fetch_async, url) pool.shutdown(wait=True)
特点:进程编程调试简单可靠性高,但是创建销毁开销大
5.多进程加回调函数执行
from concurrent.futures import ProcessPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response def callback(future): print(future.result()) url_list = ['http://www.github.com', 'http://www.bing.com'] pool = ProcessPoolExecutor(5) for url in url_list: v = pool.submit(fetch_async, url) v.add_done_callback(callback) pool.shutdown(wait=True)
通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以首选异步IO:
什么是异步IO?
当一个异步过程调用发出后,调用者不能立刻得到结果。实际处理这个调用的部件在完成后,通过状态、通知和回调来通知调用者。在一个CPU密集型的应用中,有一些需要处理的数据可能放在磁盘上。预先知道这些数 据的位置,所以预先发起异步IO读请求。等到真正需要用到这些数据的时候,再等待异步IO完成。使用了异步IO,在发起IO请求到实际使用数据这段时间 内,程序还可以继续做其他事情
在Python中,可实现异步IO的方法比较多,示例如下
1 import asyncio 2 3 4 @asyncio.coroutine 5 def func1(): 6 print('before...func1......') 7 yield from asyncio.sleep(5) 8 print('end...func1......') 9 10 11 tasks = [func1(), func1()] 12 13 loop = asyncio.get_event_loop() 14 loop.run_until_complete(asyncio.gather(*tasks)) 15 loop.close()
1 import aiohttp 2 import asyncio 3 4 5 @asyncio.coroutine 6 def fetch_async(url): 7 print(url) 8 response = yield from aiohttp.request('GET', url) 9 # data = yield from response.read() 10 # print(url, data) 11 print(url, response) 12 response.close() 13 14 15 tasks = [fetch_async('http://www.google.com/'), fetch_async('http://www.chouti.com/')] 16 17 event_loop = asyncio.get_event_loop() 18 results = event_loop.run_until_complete(asyncio.gather(*tasks)) 19 event_loop.close()
1 import asyncio 2 import requests 3 4 5 @asyncio.coroutine 6 def fetch_async(func, *args): 7 loop = asyncio.get_event_loop() 8 future = loop.run_in_executor(None, func, *args) 9 response = yield from future 10 print(response.url, response.content) 11 12 13 tasks = [ 14 fetch_async(requests.get, 'http://www.cnblogs.com/wupeiqi/'), 15 fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091') 16 ] 17 18 loop = asyncio.get_event_loop() 19 results = loop.run_until_complete(asyncio.gather(*tasks)) 20 loop.close()
1 import gevent 2 3 import requests 4 from gevent import monkey 5 6 monkey.patch_all() 7 8 9 def fetch_async(method, url, req_kwargs): 10 print(method, url, req_kwargs) 11 response = requests.request(method=method, url=url, **req_kwargs) 12 print(response.url, response.content) 13 14 # ##### 发送请求 ##### 15 gevent.joinall([ 16 gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), 17 gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), 18 gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}), 19 ]) 20 21 # ##### 发送请求(协程池控制最大协程数量) ##### 22 # from gevent.pool import Pool 23 # pool = Pool(None) 24 # gevent.joinall([ 25 # pool.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}), 26 # pool.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}), 27 # pool.spawn(fetch_async, method='get', url='https://www.github.com/', req_kwargs={}), 28 # ])
1 import grequests 2 3 4 request_list = [ 5 grequests.get('http://httpbin.org/delay/1', timeout=0.001), 6 grequests.get('http://fakedomain/'), 7 grequests.get('http://httpbin.org/status/500') 8 ] 9 10 11 # ##### 执行并获取响应列表 ##### 12 # response_list = grequests.map(request_list) 13 # print(response_list) 14 15 16 # ##### 执行并获取响应列表(处理异常) ##### 17 # def exception_handler(request, exception): 18 # print(request,exception) 19 # print("Request failed") 20 21 # response_list = grequests.map(request_list, exception_handler=exception_handler) 22 # print(response_list)
1 from twisted.web.client import getPage 2 from twisted.internet import reactor 3 4 REV_COUNTER = 0 5 REQ_COUNTER = 0 6 7 def callback(contents): 8 print(contents,) 9 10 global REV_COUNTER 11 REV_COUNTER += 1 12 if REV_COUNTER == REQ_COUNTER: 13 reactor.stop() 14 15 16 url_list = ['http://www.bing.com', 'http://www.baidu.com', ] 17 REQ_COUNTER = len(url_list) 18 for url in url_list: 19 deferred = getPage(bytes(url, encoding='utf8')) 20 deferred.addCallback(callback) 21 reactor.run()
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 from twisted.internet import defer 4 from twisted.web.client import getPage 5 from twisted.internet import reactor 6 7 8 @defer.inlineCallbacks 9 def task(url): 10 url = url 11 while url: 12 ret = getPage(bytes(url, encoding='utf8')) 13 ret.addCallback(one_done) 14 url = yield ret 15 16 17 i = 0 18 19 20 def one_done(arg): 21 global i 22 i += 1 23 if i == 10: 24 return 25 print('one', arg) 26 return 'http://www.cnblogs.com' 27 28 29 @defer.inlineCallbacks 30 def task_list(): 31 start_url_list = [ 32 'http://www.cnblogs.com', 33 ] 34 defer_list = [] 35 for url in start_url_list: 36 deferObj = task(url) 37 defer_list.append(deferObj) 38 yield defer.DeferredList(defer_list) 39 40 41 def all_done(arg): 42 print('done', arg) 43 reactor.stop() 44 45 46 if __name__ == '__main__': 47 d = task_list() 48 print(type(d)) 49 d.addBoth(all_done) 50 reactor.run()
1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 from tornado.httpclient import AsyncHTTPClient 4 from tornado.httpclient import HTTPRequest 5 from tornado import ioloop 6 7 8 def handle_response(response): 9 if response.error: 10 print("Error:", response.error) 11 else: 12 print(response.body) 13 # 方法同twisted 14 # ioloop.IOLoop.current().stop() 15 16 17 def func(): 18 url_list = [ 19 'http://www.google.com', 20 'http://127.0.0.1:8000/test2/', 21 ] 22 for url in url_list: 23 print(url) 24 http_client = AsyncHTTPClient() 25 http_client.fetch(HTTPRequest(url), handle_response) 26 27 28 ioloop.IOLoop.current().add_callback(func) 29 ioloop.IOLoop.current().start()
以上均是Python内置以及第三方模块提供异步IO请求模块,使用简便大大提高效率,而对于异步IO请求的本质则是【非阻塞Socket】+【IO多路复用】:
class HttpRequest: def __init__(self,sk,host,callback): self.socket = sk self.host = host self.callback = callback def fileno(self): return self.socket.fileno() class AsyncRequest: def __init__(self): self.conn = [] self.connection = [] # 用于检测是否已经连接成功 def add_request(self,host,callback): try: sk = socket.socket() sk.setblocking(0) sk.connect((host,80,)) except BlockingIOError as e: pass request = HttpRequest(sk,host,callback) self.conn.append(request) self.connection.append(request) def run(self): while True: rlist,wlist,elist = select.select(self.conn,self.connection,self.conn,0.05) for w in wlist: print(w.host,'连接成功...') # 只要能循环到,表示socket和服务器端已经连接成功 tpl = "GET / HTTP/1.0\r\nHost:%s\r\n\r\n" %(w.host,) w.socket.send(bytes(tpl,encoding='utf-8')) self.connection.remove(w) for r in rlist: # r,是HttpRequest recv_data = bytes() while True: try: chunck = r.socket.recv(8096) recv_data += chunck except Exception as e: break r.callback(recv_data) r.socket.close() self.conn.remove(r) if len(self.conn) == 0: break def f1(data): print('保存到文件',data) def f2(data): print('保存到数据库', data) url_list = [ {'host':'www.baidu.com','callback': f1}, {'host':'cn.bing.com','callback': f2}, {'host':'www.cnblogs.com','callback': f2}, ] req = AsyncRequest() for item in url_list: req.add_request(item['host'],item['callback']) req.run()