Python——eventlet
eventlet语境下的“绿色线程”普通线程之间的区别:
1. 绿色线程几乎没有开销,不用像保留普通线程一样保留“绿色线程”,每一个网络连接对应至少一个“绿色线程”;
2. 绿色线程需要人为的设置使其互相让渡CPU控制权,而不是抢占。绿色线程既能够共享数据结构,又不需要显式的互斥控制,因为只有当一个绿色线程让出了控制权后其他的绿色线程才能访问彼此共享的数据结构。
下图是eventlet中协程、hub、线程、进程之间的关系:
_______________________________________ | python process | | _________________________________ | | | python thread | | | | _____ ___________________ | | | | | hub | | pool | | | | | |_____| | _____________ | | | | | | | greenthread | | | | | | | |_____________| | | | | | | _____________ | | | | | | | greenthread | | | | | | | |_____________| | | | | | | _____________ | | | | | | | greenthread | | | | | | | |_____________| | | | | | | | | | | | | ... | | | | | |___________________| | | | | | | | |_________________________________| | | | | _________________________________ | | | python thread | | | |_________________________________| | | _________________________________ | | | python thread | | | |_________________________________| | | | | ... | |_______________________________________|
绿色线程是线程内的概念,同一个线程内的绿色线程之间是顺序执行的,绿色线程之间想要实现同步,需要开发人员在阻塞的代码位置显式植入CPU让渡,此时hub接管进行调度,寻找同一个线程内另一个可调度的绿色线程。注意绿色线程是线程内的概念,不能跨线程同步。
eventlet基本API
一、孵化绿色线程
eventlet.spawn(func, *args, **kw)
该函数创建一个使用参数 *args 和 **kw 调用函数 func 的绿色线程,多次孵化绿色线程会并行地执行任务。该函数返回一个greenthread.GreenThread 对象,可以用来获取函数 func 的返回值。
eventlet.spawn_n(func, *args, **kw)
作用类似于spawn(),只不过无法获取函数 func 执行完成时的返回值或抛出的异常。该函数的执行速度更快
eventlet.spawn_after(seconds, func, *args, **kw)
作用同于spawn(),等价于 seconds 秒后执行spawn()。可以对该函数的返回值调用 GreenThread.cancel() 退出孵化和阻止调用函数 func
二、控制绿色线程
eventlet.sleep(seconds=0)
挂起当前的绿色线程,允许其他的绿色线程执行
class eventlet.GreenPool
控制并发的绿色线程池,可以控制并发度,进而控制整个并发所消耗的内存容量,或限制代码某一部分的连接数等
class eventlet.GreenPile
GreenPile 对象代表了工作块。该对象是一个可以向其中填充工作的迭代器,便于以后从其中读取结果
class eventlet.Queue
便于执行单元之间进行数据交流的基本构件,用于绿色线程之间的通信,
class eventlet.Timeout
可以向任何东西添加超时,在 timeout 秒后抛出异常 exception。当 exception 被忽视或为None时,Timeout 实例自身会被抛出。Timeout 实例是上下文管理器(context manager),因此可以在 with 语句中使用
三、补丁函数
eventlet.import_patched(modulename, *additional_modules, **kw_additional_modules)
引入标准库模块绿化后的版本,这样后续代码以非阻塞的形式执行,所需要的参数就是目标模块的名称,具体可参考 Import Green
eventlet.monkey_patch(all=True, os=False, select=False, socket=False, thread=False, time=False)
在全局中为指定的系统模块打补丁,补丁后的模块是“绿色线程友好的”,关键字参数指示哪些模块需要被打补丁,如果 all 是真,那么所有的模块会被打补丁而无视其他参数;否则才由具体模块对应的参数控制对指定模块的补丁。多数参数为与自己同名的模块打补丁,如os, time, select,但是 socket 参数为真时,如果 ssl 模块也存在,会同时补丁socket模块和ssl模块,类似的,thread参数为真时,会补丁thread, threading 和 Queue 模块。
可以多次调用monkey_patch(),详见 Monkeypatching the Standard Library
四、网络应用
eventlet.connect(addr, family=2, bind=None)
开启客户端套接字
参数:
- addr – 目标服务器的地址,对于 TCP 套接字,这该参数应该是一个 (host, port) 元组
- family – 套接字族,可选,详见 socket 文档
- bind – 绑定的本地地址,可选
返回:
连接后的“绿色” socket 对象
eventlet.listen(addr, family=2, backlog=50)
创建套接字,可以用于 serve() 或一个定制的 accept() 循环。设置套接字的 SO_REUSEADDR 可以减少打扰。
参数:
- addr:要监听的地址,比如对于 TCP 协议的套接字,这是一个(host, port) 元组。
- family:套接字族。
- backlog:排队连接的最大个数,至少是1,上限由系统决定。
返回:
监听中的“绿色”套接字对象。
eventlet.wrap_ssl(sock, *a, **kw)
将一个普通套接字转变为一个SSL套接字,与 ssl.wrap_socket() 的接口相同。可以使用 PyOpenSSL,但是在使用 PyOpenSSL 时会无视 cert_reqs 、ssl_version 、ca_certs 、do_handshake_on_connect 和suppress_ragged_eofs 等参数。
建议使用创建模式来调用该方法,如: wrap_ssl(connect(addr)) 或 wrap_ssl(listen(addr),server_side=True) 。这样不会出现“裸”套接字监听非SSL会话的意外。
返回:
“绿色” SSL 对象。
eventlet.serve(sock, handle, concurrency=1000)
在给定的套接字上运行服务器,对于每一个到来的客户端连接,会在一个独立的绿色线程中调用参数 handle ,函数 handle 接受两个参数,一是客户端的socket对象,二是客户端地址:
def myhandle(client_sock, client_addr): print("client connected", client_addr) eventlet.serve(eventlet.listen(('127.0.0.1', 9999)), myhandle)
函数 handle 返回时将会关闭客户端套接字
serve() 会阻塞调用的绿色线程,直到服务器关闭才返回,如果需要绿色线程立即返回,可以为 serve() 孵化一个新的绿色线程
任何 handle 抛出的没有捕获的异常都会被当做serve()抛出的异常,造成服务器的终止,因此需要弄清楚应用会抛出哪些异常。handle 的返回值会被忽视。
抛出一个 StopServe 异常来妥善地结束server – that’s the only way to get the server() function to return rather than raise.
参数 concurrency 控制并发度,是任意时刻处理请求的绿色线程的数量上限,当服务器达到该上限时,它不会接受新的连接,直到有现有的完成为止。
class eventlet.StopServe
用于妥善退出 serve() 的异常类
五、绿化这个世界
所谓”绿化”是指绿化后的Python环境支持绿色线程的运行模式。Python原生的标准库不支持eventlet这种绿色线程之间互相让渡CPU控制权的执行模型,为此eventlet开发者改写了部分Python标准库(自称”补丁“)。如果想在应用中使用eventlet,需要显式地绿化自己要引入的模块。
方法一 from eventlet.green import ...
第一种方法是从eventlet.green包中引入需要的模块,eventlet.green包中引入的网络相关模块与Python标准库同名且提供相同的接口,只是进行过绿化补丁,因此支持绿色线程。比如:
from eventlet.green import socket from eventlet.green import threading from eventlet.green import asyncore
方法二 import_patched()
如果eventlet.green中缺乏所需要引入的模块,可以使用 import_patched() 函数,该函数可以绿化参数中指定的模块,该函数的参数就是要引入并绿化的模块名称:
eventlet.patcher.import_patched(module_name, *additional_modules, **kw_additional_modules)
以绿化的方式引入一个模块,这样该模块中如果用到网络相关的库时将会自动替换为绿化后的版本,比如引入的模块中用到了socket库,那么import_patched()后的模块使用的将不再是原生的Python socket模块而是绿化后的socket模块。
该方法的一个问题是不能正确处理延迟引入(late )
该方法的另一个好处是可以通过参数 *additional_modules 和 **kw_additional_modules 指定哪些模块需要被绿化,比如:
from eventlet.green import socket from eventlet.green import SocketServer BaseHTTPServer = eventlet.import_patched('BaseHTTPServer', ('socket', socket), ('SocketServer', SocketServer)) #BaseHTTPServer = eventlet.import_patched('BaseHTTPServer', # socket=socket, SocketServer=SocketServer)
此时只绿化 BaseHTTPServer 中引用的 socket 和 SocketServer 模块,注释掉的代码功能与它上面三行的功能相同。
方法三 猴子补丁
eventlet中的猴子补丁是在运行时修改已有的代码,动态替换已有的标准库:
eventlet.patcher.monkey_patch(os=None, select=None, socket=None, thread=None, time=None, psycopg=None)
如果调用该方法时没有指定参数,会为所有默认参数中提到的库打补丁:
import eventlet eventlet.monkey_patch()
关键字参数指示哪些模块需要被打补丁,如果 all 是真,那么所有的模块会被打补丁而无视其他参数;否则才由具体模块对应的参数控制对指定模块的补丁。多数参数为与自己同名的模块打补丁,如os, time, select,但是 socket 参数为真时,如果 ssl 模块也存在,会同时补丁socket模块和ssl模块,类似的,thread参数为真时,会补丁thread, threading 和 Queue 模块:
import eventlet eventlet.monkey_patch(socket=True, select=True)
在应用中越早调用monkey_patch()越好,比如作为主模块的第一行代码,这样做可以避免例如下面的情形:已经定义一个子类,该子类继承一个需要被补丁的父类,但是此时还没有猴子补丁该父类所在的模块。
eventlet.patcher.is_monkey_patched(module)
判断指定的模块是否已经被猴子补丁了。
六、 Eventlet使用实例
下面的这些例子来源于官方文档,这里会分别对其进行简要的说明。
1. 客户端网络爬虫
import eventlet from eventlet.green import urllib2 urls = ["http://www.google.com/intl/en_ALL/images/logo.gif", "https://wiki.secondlife.com/w/images/secondlife.jpg", "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"] def fetch(url): return urllib2.urlopen(url).read() pool = eventlet.GreenPool() for body in pool.imap(fetch, urls): print("got body", len(body))
第2行引入绿化后的 urllib2,除了使用绿化后的套接字外,与原有的标准库完全相同。
第11行创建一个绿色线程池,此处缺省容量为1000,线程池可以控制并发,限制内存消耗的上限;
第12行遍历并行调用函数 fetch 后的结果,imap 可以并行调用函数 fetch ,返回结果的先后顺序和执行的先后顺序相同。
这个例子的关键就在于客户端起了若干的绿色线程,并行收集网络爬取的结果,同时由于绿色线程池加了内存帽,也不会因为url列表过大而消耗过多的内存。
1.1 稍稍完善的客户端网络爬虫
该例子与 例1 类似。
#!/usr/bin/env python import eventlet from eventlet.green import urllib2 urls = [ "https://www.google.com/intl/en_ALL/images/logo.gif", "http://python.org/images/python-logo.gif", "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif", ] def fetch(url): print("opening", url) body = urllib2.urlopen(url).read() print("done with", url) return url, body pool = eventlet.GreenPool(200) for url, body in pool.imap(fetch, urls): print("got body from", url, "of length", len(body))
执行结果:
('opening', 'https://www.google.com/intl/en_ALL/images/logo.gif') ('opening', 'http://python.org/images/python-logo.gif') ('opening', 'http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif') ('done with', 'http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif') ('done with', 'https://www.google.com/intl/en_ALL/images/logo.gif') ('got body from', 'https://www.google.com/intl/en_ALL/images/logo.gif', 'of length', 8558) ('done with', 'http://python.org/images/python-logo.gif') ('got body from', 'http://python.org/images/python-logo.gif', 'of length', 2549) ('got body from', 'http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif', 'of length', 1874)
开始打印的三行“opening”说明并行启动三个绿色线程,每个绿色线程是一个调用 fetch 函数的容器,注意起这三个绿色线程时的顺序;
函数 fetch 打印“done with”的顺序进一步说明 imap 是并行触发绿色线程调用函数 fetch 的,注意爬取 y3.gif 的函数退出后并没有立即返回主函数,而是等待它前面的两个绿色线程退出,这是因为 imap 返回结果的先后顺序和执行的先后顺序相同,也解释了为什么我们说绿色线程之间实质上是顺序执行的。
2. 简单服务器
import eventlet def handle(client): while True: c = client.recv(1) if not c: break client.sendall(c) server = eventlet.listen(('0.0.0.0', 6000)) pool = eventlet.GreenPool(10000) while True: new_sock, address = server.accept() pool.spawn_n(handle, new_sock)
server = eventlet.listen(('0.0.0.0', 6000)) 一句创建一个监听套接字;
pool = eventlet.GreenPool(10000) 一句创建一个绿色线程池,最多可以容纳10000个客户端连接;
new_sock, address = server.accept() 一句很特殊,由于这里创建的服务器套接字是经过绿化的,所以当多个连接到来时在accept()这里不会阻塞,而是并行接收
pool.spawn_n(handle, new_sock) 一句为每一个客户端创建一个绿色线程,该绿色线程不在乎回调函数 handle 的执行结果,也就是完全将客户端套接字交给回调 handle 处理。
2.1
#-*-encoding:utf-8-*- #! /usr/bin/env python """\ 这个简单的服务器实例监听端口 6000,响应每一个用户输入, 运行该文件启动该服务器, 通过执行: telnet localhost 6000 连接到它,可以通过终止 telnet 断开连接(通常 Ctrl-] 然后 'quit') """ from __future__ import print_function import eventlet def handle(fd): print("client connected") while True: # pass through every non-eof line x = fd.readline() if not x: break fd.write(x) fd.flush() print("echoed", x, end=' ') print("client disconnected") print("server socket listening on port 6000") server = eventlet.listen(('0.0.0.0', 6000)) pool = eventlet.GreenPool() while True: try: new_sock, address = server.accept() print("accepted", address) pool.spawn_n(handle, new_sock.makefile('rw')) except (SystemExit, KeyboardInterrupt): break
3. Feed 挖掘机
该用例下,一个服务端同时也是另一个服务的客户端,比如代理等,这里 GreenPile 就发挥作用了。
下面的例子中,服务端从客户端接收 POST 请求,请求中包括含有 RSS feed 的URL,服务端并发地到 feed 服务器那里取回所有的 feed 然后将他们的标题返回给客户端:
import eventlet feedparser = eventlet.import_patched('feedparser') pool = eventlet.GreenPool() def fetch_title(url): d = feedparser.parse(url) return d.feed.get('title', '') def app(environ, start_response): pile = eventlet.GreenPile(pool) for url in environ['wsgi.input'].readlines(): pile.spawn(fetch_title, url) titles = '\n'.join(pile) start_response('200 OK', [('Content-type', 'text/plain')]) return [titles]
使用绿色线程池的好处是控制并发, 如果没有这个并发控制的话,客户端可能会让服务端在 feed 服务器那里起很多的连接,导致服务端被feed服务器给 ban 掉。
完整的例子:
"""A simple web server that accepts POSTS containing a list of feed urls, and returns the titles of those feeds. """ import eventlet feedparser = eventlet.import_patched('feedparser') # the pool provides a safety limit on our concurrency pool = eventlet.GreenPool() def fetch_title(url): d = feedparser.parse(url) return d.feed.get('title', '') def app(environ, start_response): if environ['REQUEST_METHOD'] != 'POST': start_response('403 Forbidden', []) return [] # the pile collects the result of a concurrent operation -- in this case, # the collection of feed titles pile = eventlet.GreenPile(pool) for line in environ['wsgi.input'].readlines(): url = line.strip() if url: pile.spawn(fetch_title, url) # since the pile is an iterator over the results, # you can use it in all sorts of great Pythonic ways titles = '\n'.join(pile) start_response('200 OK', [('Content-type', 'text/plain')]) return [titles] if __name__ == '__main__': from eventlet import wsgi wsgi.server(eventlet.listen(('localhost', 9010)), app)
4. WSGI 服务器
"""This is a simple example of running a wsgi application with eventlet. For a more fully-featured server which supports multiple processes, multiple threads, and graceful code reloading, see: http://pypi.python.org/pypi/Spawning/ """ import eventlet from eventlet import wsgi def hello_world(env, start_response): if env['PATH_INFO'] != '/': start_response('404 Not Found', [('Content-Type', 'text/plain')]) return ['Not Found\r\n'] start_response('200 OK', [('Content-Type', 'text/plain')]) return ['Hello, World!\r\n'] wsgi.server(eventlet.listen(('', 8090)), hello_world)
5. 套接字连接
"""Spawn multiple workers and collect their results. Demonstrates how to use the eventlet.green.socket module. """ from __future__ import print_function import eventlet from eventlet.green import socket def geturl(url): c = socket.socket() ip = socket.gethostbyname(url) c.connect((ip, 80)) print('%s connected' % url) c.sendall('GET /\r\n\r\n') return c.recv(1024) urls = ['www.google.com', 'www.yandex.ru', 'www.python.org'] pile = eventlet.GreenPile() for x in urls: pile.spawn(geturl, x) # note that the pile acts as a collection of return values from the functions # if any exceptions are raised by the function they'll get raised here for url, result in zip(urls, pile): print('%s: %s' % (url, repr(result)[:50]))
6. 多用户聊天服务器
import eventlet from eventlet.green import socket PORT = 3001 participants = set() def read_chat_forever(writer, reader): line = reader.readline() while line: print("Chat:", line.strip()) for p in participants: try: if p is not writer: # Don't echo p.write(line) p.flush() except socket.error as e: # ignore broken pipes, they just mean the participant # closed its connection already if e[0] != 32: raise line = reader.readline() participants.remove(writer) print("Participant left chat.") try: print("ChatServer starting up on port %s" % PORT) server = eventlet.listen(('0.0.0.0', PORT)) while True: new_connection, address = server.accept() print("Participant joined chat.") new_writer = new_connection.makefile('w') participants.add(new_writer) eventlet.spawn_n(read_chat_forever, new_writer, new_connection.makefile('r')) except (KeyboardInterrupt, SystemExit): print("ChatServer exiting.")
7. 端口转运工
""" This is an incredibly simple port forwarder from port 7000 to 22 on localhost. It calls a callback function when the socket is closed, to demonstrate one way that you could start to do interesting things by starting from a simple framework like this. """ import eventlet def closed_callback(): print("called back") def forward(source, dest, cb=lambda: None): """Forwards bytes unidirectionally from source to dest""" while True: d = source.recv(32384) if d == '': cb() break dest.sendall(d) listener = eventlet.listen(('localhost', 7000)) while True: client, addr = listener.accept() server = eventlet.connect(('localhost', 22)) # two unidirectional forwarders make a bidirectional one eventlet.spawn_n(forward, client, server, closed_callback) eventlet.spawn_n(forward, server, client)
8. 网页递归爬虫
"""This is a recursive web crawler. Don't go pointing this at random sites; it doesn't respect robots.txt and it is pretty brutal about how quickly it fetches pages. The code for this is very short; this is perhaps a good indication that this is making the most effective use of the primitves at hand. The fetch function does all the work of making http requests, searching for new urls, and dispatching new fetches. The GreenPool acts as sort of a job coordinator (and concurrency controller of course). """ from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re # http://daringfireball.net/2009/11/liberal_regex_for_matching_urls url_regex = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))') def fetch(url, seen, pool): """Fetch a url, stick any found urls into the seen set, and dispatch any new ones to the pool.""" print("fetching", url) data = '' with eventlet.Timeout(5, False): data = urllib2.urlopen(url).read() for url_match in url_regex.finditer(data): new_url = url_match.group(0) # only send requests to eventlet.net so as not to destroy the internet if new_url not in seen and 'eventlet.net' in new_url: seen.add(new_url) # while this seems stack-recursive, it's actually not: # spawned greenthreads start their own stacks pool.spawn_n(fetch, new_url, seen, pool) def crawl(start_url): """Recursively crawl starting from *start_url*. Returns a set of urls that were found.""" pool = eventlet.GreenPool() seen = set() fetch(start_url, seen, pool) pool.waitall() return seen seen = crawl("http://eventlet.net") print("I saw these urls:") print("\n".join(seen))
9. 生产者/消费者网络爬虫
"""This is a recursive web crawler. Don't go pointing this at random sites; it doesn't respect robots.txt and it is pretty brutal about how quickly it fetches pages. This is a kind of "producer/consumer" example; the fetch function produces jobs, and the GreenPool itself is the consumer, farming out work concurrently. It's easier to write it this way rather than writing a standard consumer loop; GreenPool handles any exceptions raised and arranges so that there's a set number of "workers", so you don't have to write that tedious management code yourself. """ from __future__ import with_statement from eventlet.green import urllib2 import eventlet import re # http://daringfireball.net/2009/11/liberal_regex_for_matching_urls url_regex = re.compile(r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))') def fetch(url, outq): """Fetch a url and push any urls found into a queue.""" print("fetching", url) data = '' with eventlet.Timeout(5, False): data = urllib2.urlopen(url).read() for url_match in url_regex.finditer(data): new_url = url_match.group(0) outq.put(new_url) def producer(start_url): """Recursively crawl starting from *start_url*. Returns a set of urls that were found.""" pool = eventlet.GreenPool() seen = set() q = eventlet.Queue() q.put(start_url) # keep looping if there are new urls, or workers that may produce more urls while True: while not q.empty(): url = q.get() # limit requests to eventlet.net so we don't crash all over the internet if url not in seen and 'eventlet.net' in url: seen.add(url) pool.spawn_n(fetch, url, q) pool.waitall() if q.empty(): break return seen seen = producer("http://eventlet.net") print("I saw these urls:") print("\n".join(seen))
10. Websocket 服务器
import eventlet from eventlet import wsgi from eventlet import websocket from eventlet.support import six # demo app import os import random @websocket.WebSocketWSGI def handle(ws): """ This is the websocket handler function. Note that we can dispatch based on path in here, too.""" if ws.path == '/echo': while True: m = ws.wait() if m is None: break ws.send(m) elif ws.path == '/data': for i in six.moves.range(10000): ws.send("0 %s %s\n" % (i, random.random())) eventlet.sleep(0.1) def dispatch(environ, start_response): """ This resolves to the web page or the websocket depending on the path.""" if environ['PATH_INFO'] == '/data': return handle(environ, start_response) else: start_response('200 OK', [('content-type', 'text/html')]) return [open(os.path.join( os.path.dirname(__file__), 'websocket.html')).read()] if __name__ == "__main__": # run an example app from the command line listener = eventlet.listen(('127.0.0.1', 7000)) print("\nVisit http://localhost:7000/ in your websocket-capable browser.\n") wsgi.server(listener, dispatch)
11. Websocket 多用户聊天
import os import eventlet from eventlet import wsgi from eventlet import websocket PORT = 7000 participants = set() @websocket.WebSocketWSGI def handle(ws): participants.add(ws) try: while True: m = ws.wait() if m is None: break for p in participants: p.send(m) finally: participants.remove(ws) def dispatch(environ, start_response): """Resolves to the web page or the websocket depending on the path.""" if environ['PATH_INFO'] == '/chat': return handle(environ, start_response) else: start_response('200 OK', [('content-type', 'text/html')]) html_path = os.path.join(os.path.dirname(__file__), 'websocket_chat.html') return [open(html_path).read() % {'port': PORT}] if __name__ == "__main__": # run an example app from the command line listener = eventlet.listen(('127.0.0.1', PORT)) print("\nVisit http://localhost:7000/ in your websocket-capable browser.\n") wsgi.server(listener, dispatch)