爬虫多线程高效高速爬取图片

6.23 自我总结

爬虫多线程高效高速爬取图片

基于之前的爬取代码我们进行函数的封装并且加入多线程

之前的代码https://www.cnblogs.com/pythonywy/p/11066842.html

from concurrent import futures导入的模块

ex = futures.ThreadPoolExecutor(max_workers =22) #设置线程个数

ex.submit(方法,方法需要传入的参数)

import os
import requests
from lxml.html import etree
from concurrent import futures  #多线程

url = 'http://www.doutula.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',}
def img_url_lis(url):
    response = requests.get(url,headers = headers)
    response.encoding = 'utf8'
    response_html = etree.HTML(response.text)
    img_url_lis = response_html.xpath('.//img/@data-original')
    return img_url_lis


#创建图片文件夹
img_file_path = os.path.join(os.path.dirname(__file__),'img')
if not os.path.exists(img_file_path):  # 没有文件夹名创建文件夹
    os.mkdir(img_file_path)
print(img_file_path)

def dump_one_img(url):
    name = str(url).split('/')[-1]
    response = requests.get(url, headers=headers)
    img_path = os.path.join(img_file_path, name)
    with open(img_path, 'wb') as fw:
        fw.write(response.content)


def dump_imgs(urls:list):
    for url in urls:
        ex = futures.ThreadPoolExecutor(max_workers =22)  #多线程
        ex.submit(dump_one_img,url)   #方法,对象
        # dump_one_img(url)


def run():
    count = 1
    while True:
        if count == 10:
            count += 1
            continue
        lis = img_url_lis(f'http://www.doutula.com/article/list/?page={count}')
        if len(lis) == 0:
            print(count)
            break
        dump_imgs(lis)
        print(f'第{count}页也就完成')
        count +=1

if __name__ == '__main__':
    run()

可以更加快速的爬取多个内容

posted @ 2019-06-23 15:29 小小咸鱼YwY 阅读(1296) 评论(0) 编辑收藏举报

刷新页面返回顶部

加载时间中.....

Python 前端 爬虫 数据库 Django Flask 微信小程序 Linux Go

爬虫多线程高效高速爬取图片

6.23 自我总结

爬虫多线程高效高速爬取图片

公告

加载时间中.....

Python 前端 爬虫 数据库 Django Flask 微信小程序 Linux Go

爬虫多线程高效高速爬取图片

6.23 自我总结

爬虫多线程高效高速爬取图片

公告

Python 前端爬虫数据库 Django Flask 微信小程序 Linux Go