并发爬取网站图片
某网站的图片:
通过“https://photo.fengniao.com/#p=4”(人像)进入某一主题。
显示的是几十张缩略的小图片以及相应的跳转地址,点击小图片后获取大图片。
想获取小图片背后的大图片,如果通过串行方法依次访问大图链接后保存,会非常耗时。
1,使用多线程获取图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import requests from lxml import etree from concurrent.futures import ThreadPoolExecutor from functools import partial def get_paths(path, regex, code): """ :param path: 网页 :param regex: 解析规则 :param code: 编码 :return: 根据解析规则,解析网页后返回内容列表 """ resp = requests.get(path) if resp.status_code = = 200 : select = etree.HTML(resp.text) paths = select.xpath(regex) return paths def save_pic(path, pic_name, directory): """ :param pic_name: 保存的图片名称 :param path: 图片的地址 :param directory: 保存的图片目录 :return: """ resp = requests.get(path, stream = True ) if resp.status_code = = 200 : with open ( '{}/{}.jpg' . format (directory, pic_name), 'wb' ) as f: f.write(resp.content) if __name__ = = '__main__' : paths = get_paths( 'https://photo.fengniao.com/#p=4' , '//a[@class="pic"]/@href' , 'utf-8' ) paths = [ 'https://photo.fengniao.com/' + p for p in paths] # 获取所有大图片路径 p = partial(get_paths, regex = '//img[@class="picBig"]/@src' , code = 'utf-8' ) # 冻结解析规则,编码 with ThreadPoolExecutor() as excutor: res = excutor. map (p, paths) big_paths = [i[ 0 ] for i in res] # 拿到所有图片的路径 # 保存图片 p = partial(save_pic, directory = 'fn_pics' ) # 冻结保存目录 with ThreadPoolExecutor() as excutor: res = excutor. map (p, big_paths, range ( len (big_paths))) [r for r in res] |
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步