爬虫之多线程 多进程 自定义异步IO框架
什么是进程?
进程是程序运行的实例,是系统进行资源分配和调度的一个独立单位,它包括独立的地址空间,资源以及1个或多个线程。
什么是线程?
线程可以看成是轻量级的进程,是CPU调度和分派的基本单位。
进程和线程的区别?
- 调度:从上面的定义可以看出一个是调度和分派的基本单位,一个是拥有资源的基本单位
- 共享地址空间,资源:进程拥有各自独立的地址空间,资源,之所以共享复杂,需要用IPC,同步简单;线程共享所需进程的资源,共享简单,但同步复杂,要通过加锁等措施。
- 占用内存,CPU:进程占用内存多,切换复杂,CPU利用率低;线程占用内存少,切换简单,CPU利用率高。
- 相互影响:进程间不会相互影响;
应用场景?
计算型:就用进程,CPython有一个GIL锁(GIL全称Global Interpreter Lock),同一时刻,只能有一个进程使用。
IO密集型:就用线程,因为IO操作不通过CPU。
同步和异步的概念:
同步是指用户线程发起IO请求后需要等待或者轮询内核IO操作完成后才能继续执行。
异步是指用户线程发起IO请求后仍继续执行,当内核IO操作完成后通知用户线程,或者调用用户线程注册的回调函数。
阻塞和非阻塞的概念:
阻塞是指IO操作需要彻底完成后才返回到用户空间。
非阻塞是指IO操作被调用后立即返回给用户一个状态值,无需等到IO操作彻底完成。
有趣得进程和线程理解:
参考博客:http://www.ruanyifeng.com/blog/2013/04/processes_and_threads.html
一、多线程
可以实现并发 但是,请求发送出去后和返回之前,中间时期线程空闲 编写方式: - 直接返回处理 - 通过回调函数处理
方法一:
#!/usr/bin/env python # -*- coding:utf-8 -*- #线程 from concurrent.futures import ThreadPoolExecutor import requests import time def task(rul): response = requests.get(url) print(url, response) pool = ThreadPoolExecutor(6) url_list = [ 'http://huaban.com/favorite/beauty/', 'https://www.bing.com/', 'https://www.baidu.com/', 'https://www.sina.com/', 'https://www.zhihu.com/', 'https://www.tencent.com/', ] for url in url_list: pool.submit(task,url) pool.shutdown(wait=True)
方法二:
#!/usr/bin/env python # -*- coding:utf-8 -*- #进程 from concurrent.futures import ThreadPoolExecutor import requests import time def task(url): response = requests.get(url) return response def done(future,*args,**kwargs): response = future.result() print(response) pool = ThreadPoolExecutor(6) url_list = [ 'http://huaban.com/favorite/beauty/', 'https://www.bing.com/', 'https://www.baidu.com/', 'https://www.sina.com/', 'https://www.zhihu.com/', 'https://www.tencent.com/', ] for url in url_list: v = pool.submit(task,url) v.add_done_callback(done) pool.shutdown(wait=True)
二、多进程
可以实现并发 但是,请求发送出去后和返回之前,中间时期进程空闲 编写方式: - 直接返回处理 - 通过回调函数处理
方法一:
from concurrent.futures import ProcessPoolExecutor import requests import time def task(url): response = requests.get(url) print(url,response) # 写正则表达式 pool = ProcessPoolExecutor(7) url_list = [ 'http://www.cnblogs.com/wupeiqi', 'http://huaban.com/favorite/beauty/', 'http://www.bing.com', 'http://www.zhihu.com', 'http://www.sina.com', 'http://www.baidu.com', 'http://www.autohome.com.cn', ] for url in url_list: pool.submit(task,url) pool.shutdown(wait=True)
方法二:
from concurrent.futures import ProcessPoolExecutor import requests import time def task(url): response = requests.get(url) return response def done(future,*args,**kwargs): response = future.result() print(response.status_code,response.content) pool = ProcessPoolExecutor(7) url_list = [ 'http://www.cnblogs.com/wupeiqi', 'http://huaban.com/favorite/beauty/', 'http://www.bing.com', 'http://www.zhihu.com', 'http://www.sina.com', 'http://www.baidu.com', 'http://www.autohome.com.cn', ] for url in url_list: v = pool.submit(task,url) v.add_done_callback(done) pool.shutdown(wait=True
三、协程(微线程)+异步IO = 1个线程发送N个Http请求
异步IO (asyncio IO)
1、单线程,实现伪并发,同时处理两个任务,所有请求同时发出去,有返回值就继续执行
#!/usr/bin/env python # -*- coding:utf-8 -*- import asyncio @asyncio.coroutine def task(): print('before...task......') yield from asyncio.sleep(5) # 发送Http请求,支持TCP获取结果.. print('end...task......') tasks = [task(), task()] loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
执行结果:
before...task......
before...task......
end...task......
end...task......
2、单线程,发送http请求原理(asyncio 内部完成了,异步IO操作)
asyncio:内部其实就是自己封装了http数据包
安装模块:
pip3 install asyncio
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- import asyncio @asyncio.coroutine def task(host, url='/'): print('start', host, url) reader, writer = yield from asyncio.open_connection(host, 80) #创建连接 request_header_content = "GET %s HTTP/1.0\r\nHost: %s\r\n\r\n" % (url, host,) #http请求格式 request_header_content = bytes(request_header_content, encoding='utf-8') writer.write(request_header_content) yield from writer.drain() text = yield from reader.read() #返回结果 print('end', host, url, text) writer.close() #两个任务 tasks = [ task('www.cnblogs.com', '/wupeiqi/'), task('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
3、asyncio(异步IO)+aiohttp(内部封装Http数据包)
安装模块:
pip3 install aiohttp
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- import aiohttp import asyncio @asyncio.coroutine def fetch_async(url): print(url) response = yield from aiohttp.request('GET', url) print(url, response) response.close() tasks = [fetch_async('http://www.baidu.com/'), fetch_async('http://www.chouti.com/')] event_loop = asyncio.get_event_loop() results = event_loop.run_until_complete(asyncio.gather(*tasks)) event_loop.close()
4、asyncio+requests模块
安装模块
pip3 install requests
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- import asyncio import requests @asyncio.coroutine def task(func, *args): print(func,args) loop = asyncio.get_event_loop() future = loop.run_in_executor(None, func, *args) # requests.get('http://www.cnblogs.com/wupeiqi/') response = yield from future print(response.url, response.content) tasks = [ task(requests.get, 'http://www.cnblogs.com/wupeiqi/'), task(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091') ] loop = asyncio.get_event_loop() results = loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
四、gevent,依赖greenlet(是个协程的模块)+异步IO(requests模块)
协程是微线程,规定这个线程先执行点什么,再执行点什么。
安装模块:
pip3 install greenlet pip3 install gevent
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- import gevent import requests from gevent import monkey monkey.patch_all() #找到你内部封装的异步IO的socket def task(method, url, req_kwargs): #第一个任务来先传到这里 print(method, url, req_kwargs) #request(get内部就是调用了request) response = requests.request(method=method, url=url, **req_kwargs) print(response.url, response.content) ##### 发送请求 ##### # gevent.joinall([ # gevent.spawn(task, method='get', url='https://www.python.org/', req_kwargs={}), # gevent.spawn(task, method='get', url='https://www.yahoo.com/', req_kwargs={}), # gevent.spawn(task, method='get', url='https://github.com/', req_kwargs={}), # ]) ##### 发送请求(协程池控制最大协程数量,就是控制往远程发送多少个请求) ##### from gevent.pool import Pool pool = Pool(5) gevent.joinall([ pool.spawn(task, method='get', url='https://www.python.org/', req_kwargs={}), pool.spawn(task, method='get', url='https://www.yahoo.com/', req_kwargs={}), pool.spawn(task, method='get', url='https://www.github.com/', req_kwargs={}), ])
执行结果:
get https://www.python.org/ {} get https://www.yahoo.com/ {} get https://www.github.com/ {} 后面省略......
五、grequests模块 = gevent+requests
安装模块:
pip3 install grequests
代码:
#!/usr/bin/env python # -*- coding:utf-8 -*- import grequests request_list = [ grequests.get('http://httpbin.org/delay/1', timeout=0.001), grequests.get('http://fakedomain/'), grequests.get('http://httpbin.org/status/500') ] #执行并获取响应列表 request_list = grequests.map(request_list) print(request_list)
执行结果:
[None, None, <Response [500]>]
六、Twisted(用一个线程,发多个请求)
#!/usr/bin/env python # -*- coding:utf-8 -*- from twisted.internet import defer from twisted.web.client import getPage from twisted.internet import reactor def one_done(arg): print(arg) def all_done(arg): print('done') reactor.stop() @defer.inlineCallbacks def task(url): res = getPage(bytes(url, encoding='utf8')) # 发送Http请求 res.addCallback(one_done) yield res url_list = [ 'http://www.cnblogs.com', 'http://www.cnblogs.com', 'http://www.cnblogs.com', 'http://www.cnblogs.com', ] defer_list = [] # [特殊,特殊,特殊(已经向url发送请求)] for url in url_list: v = task(url) defer_list.append(v) d = defer.DeferredList(defer_list) d.addBoth(all_done) reactor.run() # 死循环
七、Tornado(用一个线程,发多个请求)
#!/usr/bin/env python # -*- coding:utf-8 -*- from tornado.httpclient import AsyncHTTPClient from tornado.httpclient import HTTPRequest from tornado import ioloop COUNT = 0 def handle_response(response): global COUNT COUNT -= 1 if response.error: print("Error:", response.error) else: print(response.body) # 方法同twisted # ioloop.IOLoop.current().stop() if COUNT == 0: ioloop.IOLoop.current().stop() def func(): url_list = [ 'http://www.baidu.com', 'http://www.bing.com', ] global COUNT COUNT = len(url_list) for url in url_list: print(url) http_client = AsyncHTTPClient() http_client.fetch(HTTPRequest(url), handle_response) ioloop.IOLoop.current().add_callback(func) ioloop.IOLoop.current().start() # 死循环
执行结果:
http://www.baidu.com http://www.bing.com
八、自定义异步IO框架
import socket import select # ########################## HTTP请求本质,阻塞 ########################## """ sk = socket.socket() # 1.连接 sk.connect(('www.baidu.com',80,)) # IO阻塞 print('连接成功了...') # 2. 连接成功发送消息 sk.send(b'GET / HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n') # sk.send(b'POST / HTTP/1.0\r\nHost:www.baidu.com\r\n\r\nk1=v1&k2=v2') # 3. 等待着服务端响应 data = sk.recv(8096) # IO阻塞 print(data) # 关闭连接 sk.close() """ # ########################## HTTP请求本质,非阻塞 ########################## """ sk = socket.socket() sk.setblocking(False) # 1.连接 try: sk.connect(('www.baidu.com',80,)) # IO阻塞 print('连接成功了...') except BlockingIOError as e: print(e) # 2. 连接成功发送消息 sk.send(b'GET / HTTP/1.0\r\nHost:www.baidu.com\r\n\r\n') # sk.send(b'POST / HTTP/1.0\r\nHost:www.baidu.com\r\n\r\nk1=v1&k2=v2') # 3. 等待着服务端响应 data = sk.recv(8096) # IO阻塞 print(data) # 关闭连接 sk.close() """ ############################异步IO非阻塞################################ IO:就是读和写 class HttpRequest: def __init__(self,sk,host,callback): self.socket = sk self.host = host self.callback = callback def fileno(self): return self.socket.fileno() class HttpResponse: def __init__(self,recv_data): self.recv_data = recv_data self.header_dict = {} self.body = None self.initialize() def initialize(self): headers, body = self.recv_data.split(b'\r\n\r\n', 1) self.body = body header_list = headers.split(b'\r\n') for h in header_list: h_str = str(h,encoding='utf-8') v = h_str.split(':',1) if len(v) == 2: self.header_dict[v[0]] = v[1] class AsyncRequest: def __init__(self): self.conn = [] self.connection = [] # 用于检测是否已经连接成功 def add_request(self,host,callback): try: sk = socket.socket() sk.setblocking(0) sk.connect((host,80,)) except BlockingIOError as e: pass request = HttpRequest(sk,host,callback) self.conn.append(request) self.connection.append(request) def run(self): while True: rlist,wlist,elist = select.select(self.conn,self.connection,self.conn,0.05) for w in wlist: print(w.host,'连接成功...') # 只要能循环到,表示socket和服务器端已经连接成功 tpl = "GET / HTTP/1.0\r\nHost:%s\r\n\r\n" %(w.host,) w.socket.send(bytes(tpl,encoding='utf-8')) self.connection.remove(w) for r in rlist: # r,是HttpRequest recv_data = bytes() while True: try: chunck = r.socket.recv(8096) #最多接收的大小 recv_data += chunck except Exception as e: break response = HttpResponse(recv_data) r.callback(response) r.socket.close() self.conn.remove(r) if len(self.conn) == 0: break def f1(response): print('保存到文件',response.header_dict) def f2(response): print('保存到数据库', response.header_dict) url_list = [ {'host':'www.baidu.com','callback': f1}, {'host':'cn.bing.com','callback': f2}, {'host':'www.cnblogs.com','callback': f2}, ] req = AsyncRequest() for item in url_list: req.add_request(item['host'],item['callback']) req.run()
备注:本篇博客用到的模块(python3.x)
#安装安装地址如下: pip3 install aiohttp -i http://pypi.douban.com/simple --trusted-host pypi.douban.com pip3 install greenlet -i http://pypi.douban.com/simple --trusted-host pypi.douban.com pip3 install gevent -i http://pypi.douban.com/simple --trusted-host pypi.douban.com pip3 install grequests -i http://pypi.douban.com/simple --trusted-host pypi.douban.com pip3 install tornado -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
参考:http://www.cnblogs.com/wupeiqi/articles/6229292.html