Day 31 Event事件/进程池和线程池/高性能爬取梨视频/协程

Event事件
线程池和进程池
高性能爬取梨视频
协程
- gevent

Event事件

同进程的一样,线程的一个关键特性是每个线程都是独立运行且状态不可预测。如果程序中的其他线程需要通过判断某个线程的状态来确定自己下一步的操作,这时线程同步问题就会变得非常棘手。为了解决这些问题,我们需要使用threading库中的Event对象。对象包含一个可由线程设置的信号标志,它允许线程等待某些事件的发生。在初始情况下,Event对象中的信号标志被设置为假。如果有线程等待一个Event对象, 而这个Event对象的标志为假,那么这个线程将会被一直阻塞直至该标志为真。一个线程如果将一个Event对象的信号标志设置为真,它将唤醒所有等待这个Event对象的线程。如果一个线程等待一个已经被设置为真的Event对象,那么它将忽略这个事件, 继续执行。

from threading import Thread, Event
import time


e = Event()


def light():
    print('红灯亮了...')
    time.sleep(3)
    e.set()  # 唤醒其他线程中等待这个Event对象的线程
    print('绿灯亮了...')


def car(i):
    print(f'汽车{i}正在等红灯...')
    e.wait()  # 进入阻塞态,等待被唤醒
    print(f'汽车{i}开始行驶...')

线程池和进程池

在python2.7中需要手动安装futures模块

在python3.X中concurrent.futures已经成为了标准库中的模块

在concurrent.futures模块中,有ProcessPoolExecutor和ThreadPoolExecutor两个类可以调用

ProcessPoolExecutor

ProcessPoolExecutor是使用线程池执行异步调用

pool = ProcessPoolExecutor([max_workers])

max_workers表示可以开启进程的个数,默认以CPU个数限制进程数

pool.submit('传函数地址')

from concurrent.futures import ProcessPoolExecutor
import time


pool = ProcessPoolExecutor()

def task():
    print('进程任务开始了...')
    time.sleep(1)
    print('进程任务结束了...')
    
for line in range(5)
	pool.submit(task)

ThreadPoolExecutor

ThreadPoolExecutor是使用线程池执行异步调用

pool = ThreadPoolExecutor([max_workers])

max_workers表示可以开启线程的个数,默认是CPU*5个线程数

from concurrent.futures import ThreadPoolExecutor
import time


pool = ThreadPoolExecutor()

def task():
    print('线程任务开始了...')
    time.sleep(1)
    print('线程任务结束了...')
    
for line in range(5)
	pool.submit(task)

回调函数

from concurrent.futures import ThreadPoolExecutor
import time


pool = ThreadPoolExecutor()


def task(m):
    print('线程任务开始了...')
    time.sleep(1)
    print(m)
    print('线程任务结束了...')
    return 123


def call_back(res):
    print(type(res))
    # 注意:赋值操作不要与接收的res同名
    res2 = res.result()
    print(res2)


for line in range(5):
    # 可以传入参数
    pool.submit(task, '我是主函数').add_done_callback(call_back)

高性能爬取梨视频

import requests
import re
from concurrent.futures import ThreadPoolExecutor
import uuid

# url = 'https://www.pearvideo.com/'
pool = ThreadPoolExecutor(2000)


def get_page(url):
    response = requests.get(url)
    # print(response.text)
    return response


def parse_index(response):
    id_list = re.findall('<a href="video_(.*?)" .*?>', response.text, re.S)
    # print(id_list)
    return id_list


def parse_detail(res):
    video_response = res.result()
    video_detail_url = re.findall('srcUrl="(.*?)"', video_response.text, re.S)[0]
    # print(video_detail_url)
    # res = get_page(video_detail_url)
    # save_video(res)
    # pool.submit(get_page, video_detail_url).add_done_callback(save_video)
    pool.submit(get_page, video_detail_url).add_done_callback(save_video)
    # return video_detail_url


def save_video(res):
    video_response = res.result()
    # video_bytes = video_response.content
    name = uuid.uuid4()
    print(f'{name}.mp4视频开始保存...')
    with open(f'{name}.mp4', 'wb') as f:
        f.write(video_response.content)
    print(f'下载完成!')


if __name__ == '__main__':
    index_response = get_page('https://www.pearvideo.com/')

    id_list = parse_index(index_response)
    # print(id_list)

    for video_id in id_list:
        video_url = 'https://www.pearvideo.com/video_' + video_id
        # print(video_url)

        pool.submit(get_page, video_url).add_done_callback(parse_detail)

协程

线程是系统级别的它们由操作系统调度，而协程则是程序级别的由程序根据需要自己调度。在一个线程中会有很多函数，我们把这些函数称为子程序，在子程序执行过程中可以中断去执行别的子程序，而别的子程序也可以中断回来继续执行之前的子程序，这个过程就称为协程。也就是说在同一线程内一段代码在执行过程中会中断然后跳转执行别的代码，接着在之前中断的地方继续开始执行，类似与yield操作。

协程的优点：

　　（1）无需线程上下文切换的开销，协程避免了无意义的调度，由此可以提高性能

　　（2）无需原子操作锁定及同步的开销

　　（3）方便切换控制流，简化编程模型

　　（4）高并发+高扩展性+低成本：一个CPU支持上万的协程都不是问题。所以很适合用于高并发处理。

协程的缺点：

　　（1）无法利用多核资源：协程的本质是个单线程,它不能同时将单个CPU的多个核用上,协程需要和进程配合才能运行在多CPU上.当然我们日常所编写的绝大部分应用都没有这个必要，除非是cpu密集型应用。

　　（2）进行阻塞（Blocking）操作（如IO时）会阻塞掉整个程序

gevent

gevent是一个第三方模块,我们可以在cmd中输入pip3 install gevent进行安装

gevent模块中自带了sleep耗时函数，当使用这个耗时函数时，cpu会跳转到另一个就绪的程序，达到人工设置让其自动切换的功能。

from gevent import monkey, spawn, joinall
monkey.patch_all()
import time


def func1():
    print('1')
    # IO操作
    time.sleep(1)


def func2():
    print('2')
    time.sleep(3)


def func3():
    print('3')
    time.sleep(5)


start_time = time.time()
s1 = spawn(func1)
s2 = spawn(func2)
s3 = spawn(func3)

s1.join()
s2.join()
s3.join()

# joinall([s1, s2, s3])  # 效果相同

end_time = time.time()

print(end_time - start_time)

1
2
3
5.006286382675171

posted @ 2019-10-25 19:45 二二二二白、阅读(105) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

二二二二白、