day10-协程gevent并发爬网页
概述
前面我们介绍了gevent遇到I/O操作就会自动切换,现在我们使用gevent爬一个实际的网页下来
串行爬网页
from urllib import request import time def f(url): print('GET: %s' % url) resp = request.urlopen(url) #生成一个请求 data = resp.read() #读取爬取到的数据 f = open("url.html","wb") f.write(data) f.close() print('%d bytes received from %s.' % (len(data), url)) urls = ['https://www.python.org', 'https://www.yahoo.com', 'https://github.com' ] time_start = time.time() for i in urls: f(i) print("同步cost:",time.time() - time_start) #运行输出 GET: https://www.python.org 48893 bytes received from https://www.python.org. GET: https://www.yahoo.com 505354 bytes received from https://www.yahoo.com. GET: https://github.com 51489 bytes received from https://github.com. 同步cost: 3.7278189659118652 Process finished with exit code 0
gevent协程爬网页
from urllib import request import gevent,time def f(url): print('GET: %s' % url) resp = request.urlopen(url) data = resp.read() f = open("url.html","wb") f.write(data) f.close() print('%d bytes received from %s.' % (len(data), url)) async_time_start = time.time() gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ]) print("异步cost:",time.time() - async_time_start) #计算消耗的时间 #运行输出 GET: https://www.python.org/ 48893 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 504992 bytes received from https://www.yahoo.com/. GET: https://github.com/ 51489 bytes received from https://github.com/. 异步cost: 4.873842000961304 Process finished with exit code 0
通过以上同步和异步爬网页所花的时间,我们并不能看见并发比串行速度上快多少?为什么?
其实urllib默认和gevent是没有关系的。urllib现在默认情况下如果你要通过gevent来去调用,它就是阻塞的,gevent现在检测不到urllib的I/O操作。它都不知道urllib进行了IO操作,所以它都不会进行切换,所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使,因为gevent它不知道你进行了IO操作,所以就会卡住。所以他们都还是串行的,爬取网页的时间都差不多。
那么,怎样才能让gevent知道urllib正在进程I/O操作呢?打monkey.patch()的补丁即可
from urllib import request import gevent,time from gevent import monkey monkey.patch_all() #把当前程序的所有的I/O操作给我单独的做上标记 def f(url): print('GET: %s' % url) resp = request.urlopen(url) data = resp.read() f = open("url.html","wb") f.write(data) f.close() print('%d bytes received from %s.' % (len(data), url)) async_time_start = time.time() gevent.joinall([ gevent.spawn(f, 'https://www.python.org/'), gevent.spawn(f, 'https://www.yahoo.com/'), gevent.spawn(f, 'https://github.com/'), ]) print("异步cost:",time.time() - async_time_start) #计算消耗的时间 #运行输出 GET: https://www.python.org/ GET: https://www.yahoo.com/ GET: https://github.com/ 48893 bytes received from https://www.python.org/. 51489 bytes received from https://github.com/. 504160 bytes received from https://www.yahoo.com/. 异步cost: 1.559157133102417 Process finished with exit code 0
解析:通过以上可以看出其实就是通过打补丁来检测到它有urllib,它就把urllib里面所有涉及到的有可能进行I/O操作的地方直接在前面加一个标记,这个标记就相当于gevent.sleep(),所以把urllib变成一个有阻塞,那么协程一遇到阻塞,它就切换了。
通过gevent实现单线程下的多socket并发
服务器端
import sys import socket import time import gevent from gevent import socket, monkey monkey.patch_all() def server(port): s = socket.socket() s.bind(('0.0.0.0', port)) s.listen(500) #监听TCP传入连接 while True: cli, addr = s.accept() #cli就是客户端连接过来在服务器端为其生成的一个连接实例(对象) gevent.spawn(handle_request, cli) #创建协程 def handle_request(conn): try: while True: data = conn.recv(1024) #接收客户端发送来的数据 print("recv:", data) conn.send(data) if not data: break except Exception as ex: print(ex) finally: conn.close() if __name__ == '__main__': server(8001) #运行输出 recv: b'huwei' recv: b'123'
客户端
import socket HOST = 'localhost' # The remote host PORT = 8001 # The same port as used by the server s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) #创建客户端实例(对象) s.connect((HOST, PORT)) #连接远程机器 while True: msg = bytes(input(">>:"), encoding="utf8") s.sendall(msg) #发送数据到远端 data = s.recv(1024) #从远端接收数据 # print(data) print('Received',data) s.close() #运行输出 >>:huwei Received b'huwei' >>:123 Received b'123' >>: