高性能异步爬虫

说明：本文章只用于学习交流，严禁用于其他途径，如有不妥，可立即下架。

目的：在爬虫中使用异步实现高性能的数据爬取操作。异步发送请求，网络请求属于IO操作，一般使用线程异步或者协程异步。

参考文章：https://www.cnblogs.com/Blogwj123/p/15893616.html

1.异步方式

1.多线程，多进程（不建议）：
好处：可以为相关阻塞的操作单独开启线程或者进程，阻塞操作就可以异步执行。
弊端：无法无限制的开启多线程或者多进程。
2.线程池、进程池（适当的使用）：
好处：我们可以降低系统对进程或者线程创建和销毁的一个频率，从而很好的降低系统的开销。
弊端：池中线程或进程的数量是有上限。
3.单线程+异步协程（推荐）：
event_loop：事件循环，相当于一个无限循环，我们可以把一些函数注册到这个事件循环上，
当满足某些条件的时候，函数就会被循环执行。

coroutine：协程对象，我们可以将协程对象注册到事件循环中，它会被事件循环调用。
我们可以使用 async 关键字来定义一个方法，这个方法在调用时不会立即被执行，而是返回
一个协程对象。

task：任务，它是对协程对象的进一步封装，包含了任务的各个状态。

future：代表将来执行或还没有执行的任务，实际上和 task 没有本质区别。

async： 定义一个协程.

await: 用来挂起阻塞方法的执行。

2.线程池使用

线程池的使用,主要是将线程池中的知识与爬虫相结合。

import os
import re
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor

def index():
    url = 'https://v.wuaishare.cn/liaofansixun.html'
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    p_list = tree.xpath('//*[@id="post-1569"]/div/div[1]/p')[1:]
    detail_list = []
    for p in p_list:
        title = p.xpath('./a/text()')[0]
        href = p.xpath('./a/@href')[0]
        detail_list.append((title, href))
    return detail_list

def download(detail):
    name = detail[0]
    href = detail[1]
    response = requests.get(url=href).text
    tree = etree.HTML(response)
    id_xpath = re.findall(".cn/(.*?).html",href,re.S)[0]
    strxpath = '//*[@id="post-'+id_xpath+'"]/div/div[2]/p'
    p_list = tree.xpath(strxpath)
    word_list = []
    for p in p_list:
        word = p.xpath('./text()')[0]
        word_list.append(word)
    response_word = '\n\n'.join(word_list)
    response_word = name + response_word
    if not os.path.exists("liaofan"):
        os.makedirs("liaofan") # 不存在该文件夹就创建。
    file_path = os.path.join("liaofan",name+'.txt')
    with open(file_path,mode='w',encoding='utf-8') as fp:
        fp.write(response_word)
    print(name,"下载完成...")
    return name

def download_wenyanwen(detail):
    name = detail[0]
    href = detail[1]
    response = requests.get(url=href).text
    tree = etree.HTML(response)
    id_xpath = re.findall(".cn/(.*?).html",href,re.S)[0]
    strxpath = '//*[@id="post-'+id_xpath+'"]/div/div[2]/p'
    p_list = tree.xpath(strxpath)
    word_list = []
    for p in p_list:
        word = p.xpath('./text()')[0]
        word_list.append(word)
    response_word = '\n\n'.join(word_list)
    response_word = name + response_word
    if not os.path.exists("liaofan"):
        os.makedirs("liaofan") # 不存在该文件夹就创建。
    file_path = os.path.join("liaofan",name+'.txt')
    with open(file_path,mode='w',encoding='utf-8') as fp:
        fp.write(response_word)
    print(name,"下载完成...")
    return name

if __name__ == '__main__':
    detail = index()
    # 创建线程池,其中维护10个线程
    pool = ThreadPoolExecutor(10)
    for i in detail[:4]:
        pool.submit(download,i) # 白话文的下载
    for i in detail[4:]:
        pool.submit(download_wenyanwen,i) # 原文下载

进程池与此方式类似，只是将相关的任务数提交到进程中去执行。

3.协程使用

关于协程的更多内容,点击这里进行参考。

3.1 协程抓取概述

安装模块

pip install aiohttp  # 是一个支持异步的网络请求模块。

该模块的功能替代了requests的功能。

使用架构伪代码如下：

async def get_request(url):
    #实例化好了一个请求对象
    with aiohttp.ClientSession() as sess: # 需要实例化出一个session对象。
        #调用get发起请求，返回一个响应对象
        #get/post(url,headers,params/data,proxy="http://ip:port")
        with sess.get(url=url) as response: # 使用上下问管理返回响应对象。
            #获取了字符串形式的响应数据
            page_text = response.text() # 返回text的数据,返回json()的数据。
            return page_text # 返回相关的数据。

阻塞操作前加await关键字。
在每一个with前面加async关键字。
返回 text 的数据，使用text(),返回 json 数据，使用json();返回二进制数据使用read().

async def get_request(url):
    #实例化好了一个请求对象
    with aiohttp.ClientSession() as sess:
        #调用get发起请求，返回一个响应对象
        #get/post(url,headers,params/data,proxy="http://ip:port")
        with await sess.get(url=url) as response:
            #text()获取了字符串形式的响应数据
            #read()获取byte类型的响应数据
            page_text = await response.text()
            return page_text

多任务爬虫的数据解析
- 一定要使用对象的回调函数实现数据解析。
- 多任务的架构中数据爬取是封装在特殊函数中，一定要保证数据解析结束后，在实现数据解析。
使用多任务的异步协程爬取数据的实现套路：
- 可以先使用requests模块，将待请求数据，对应的url封装到列表中（同步）。
- 在使用aiohttp模式将列表的url进行异步的请求和数据解析（异步）。

3.2 实战，抓取站长素材

import os
import aiohttp
import asyncio
import requests
import time
from lxml import etree

def index():
    url = "https://sc.chinaz.com/tupian/fengjingtupian.html"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }
    image_list = []
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    total_page = tree.xpath('/html/body/div[2]/div[6]/div[1]/a[8]/b/text()')[0]
    for page in range(1, int(total_page) + 1):
        if page == 1:
            new_url = url
        else:
            new_url = 'https://sc.chinaz.com/tupian/fengjingtupian_' + str(page) + '.html'
        response = requests.get(url=new_url, headers=headers).text
        div_list = tree.xpath('//*[@id="container"]/div')
        for div in div_list:
            image_title = div.xpath('./div/a/img/@alt')[0].encode('iso-8859-1').decode('utf-8')
            image_path = "https://" + div.xpath('./div/a/img/@src')[0].encode('iso-8859-1').decode('utf-8')
            image_list.append((image_title, image_path))
        print("第",page,'页,采集中...')
    return image_list # 返回url列表
# 编写异步协程函数
async def fetch(session,image_item):
    url = image_item[1]
    title = image_item[0]
    async with session.get(url) as response:
        content = await response.read()
        file_name = title
        if not os.path.exists('images'):
            os.mkdir("images")
        with open(file_name,mode='wb') as fp:
            fp.write(content)
        print(file_name,",下载完成....")


async def get_request(image):
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(fetch(session,image_item)) for image_item in image] # 创建协程任务队列
        await asyncio.wait(tasks)

if __name__ == '__main__':
    start = time.time()
    image_list = index()
    print(image_list)
    asyncio.run(get_request(image_list))
    print("总耗时:",time.time()-start)

可能会设涉及到IP被封，运行失败的情况，可以将代理IP,添加到相关的位置。此处不在展示相关的运行界面。

本次抓取的时候并未有图片懒加载的情况，如果出现图片懒加载的情况，可以使用相关的xpath将未加载前的属性获取即可

继续努力，终成大器。

posted @ 2022-07-24 14:31 紫青宝剑阅读(81) 评论(0) 编辑收藏举报

刷新页面返回顶部

紫青宝剑

高性能异步爬虫

高性能异步爬虫

1.异步方式

2.线程池使用

3.协程使用

3.1 协程抓取概述

3.2 实战，抓取站长素材

公告