动态加载页面的爬虫方法之Selenium

任务:爬取果果漫画网站里《白圣女与黑牧师》第七话的所有图片(https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html)。

(最新修改:这个网站已经挂了,不过该篇叙述的技术依然好用)

首先,可以直接手动拉到网页最下面,然后把F12里面的网页节点元素复制成文本,去获取目标进行下载,代码如下,用到的库BeautifulSoup:

import os
import urllib.request
import re
from bs4 import BeautifulSoup as bs
import random as rd
import time
def get_imgs(text):
    soup = bs(text, features="lxml")
    imgss = soup.find_all("img")
    print(imgss)
    pattern = re.compile('src=\"([^\"]+)')
    source = pattern.findall(str(imgss))
    print("===========")
    for pic in source:
        print(pic)
    print("===========")
    print("本页一共有" + str(len(source)) + "页漫画")
    count = 0

    for item in source:
        count = count + 1
        if count < 11:
            continue
        print('' + str(count) + '')
        p = str(count) if count > 9 else "0"+str(count)
        name = "c05-p"+p
        download(item, "manga-白圣女与黑牧师", "c06", name)
        time.sleep(rd.randint(2, 4))

    print("---------爬取结束--------")
    return len(source)

def download(realUrl, dir, subDir, name):
    path = subDir + '/'
    if not os.path.exists(path):
        os.makedirs(path)
    try:
        urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name))
    except Exception as e:
        print("发生运行时异常:", e)
    finally:
        pass

text = """
<div id="images"><img src="https://n1a.zhjyu.net/images/p/c9/6b/888e04ccb13ae1d2d075142f4fe4.jpg" data-index="1" style="display: 
略...
/190aaf19e561363d805e437485da.jpg" data-index="40" style="display: inline;"><p class="img_info">(40/40)</p></div>
"""

get_imgs(text)

这样虽可行,但是仍然是手动的,采用自动的方法获取页面文档方法如下:

由于网页是动态加载的,无法直接通过URL获取完整的页面文档,因此就要模拟浏览器网页下拉的操作了,用到库selenuim。

第一步要做的就是,用selenium打开浏览器然后打开指定网页。

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")
    driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话

报错如下:

Traceback (most recent call last):
  File "guoguomh.py", line 65, in <module>
    get_dynamic_text_through_webdriver()
  File "guoguomh.py", line 15, in get_dynamic_text_through_webdriver
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")
......略
selenium.common.exceptions.WebDriverException: Message: Service C:/Program Files (x86)/Google/Chrome/Application/chrome.exe unexpectedly exited. Status code was: 0

代码修改如下:

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话

仍然报错:

......略
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 92 Current browser version is 114.0.5735.199 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe

 可能是chromedriver的版本问题,按照报错信息里面给定的Browser Version版本,在网站“https://chromedriver.chromium.org/downloads”里下载对应版本的driver(114版本就可)。然后解压,把chromedriver.exe放到Chromer浏览器chrome.exe同一目录下即可。

启动成功。

接下来就是爬虫的经典流程。

遇到问题如下

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    try:
        driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话
    except Exception as e:
        print(e)
    driver.maximize_window()
    driver.execute_script("document.getElementById('images').scrollTo=12000")
    time.sleep(6)  # 定时6s等待内容加载
    print("定时结束,爬取内容")
    html_source = driver.page_source
    print(html_source)

由此获取到的页面源码不标准,因此不应使用driver.page_source获取页面文档,而应是执行JavaScript代码返回页面文档或者直接调用driver的xpath功能。

完整代码如下(尚有缺陷,个别图片下载损坏,将用try-catch来跳过,避免程序中断):

# -*- coding: utf-8 -*-
# @Author : Zhao Ke
# @Time : 2023-07-18 11:58
import os
import re
import time
import random as rd
import urllib.request
from bs4 import BeautifulSoup as bs
from selenium import webdriver


def get_dynamic_text_through_webdriver(page_url):
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    try:
        driver.get(page_url)  # 第七话
    except Exception as e:
        print(e)
    driver.maximize_window()
    driver.execute_script("document.getElementById('images').scrollTo=12000")
    time.sleep(2)  # 定时等待加载
    print("定时结束,爬取内容")

    # 初值
    html_source = driver.find_element_by_xpath('//*[@id="images"]')
    indices = html_source.text[-7:]  # 获取指定元素里的文档里面的文本
    curr = int(indices[1:3])
    maxi = int(indices[4:6])
    # 迭代直到下拉到最后
    while curr < maxi:
        driver.execute_script("window.scrollBy(0, 10000)")
        html_source = driver.find_element_by_xpath('//*[@id="images"]')
        indices = html_source.text[-7:]  # 获取指定元素里的文档里面的文本
        curr = int(indices[1:3])
        maxi = int(indices[4:6])
        time.sleep(0.5)
    return html_source.get_attribute("outerHTML")  # 获取指定元素的文档


def get_imgs(text, chapter):
    soup = bs(text, features="lxml")
    imgss = soup.find_all("img")
    print(imgss)
    pattern = re.compile('src=\"([^\"]+)')
    source = pattern.findall(str(imgss))
    print("===========")
    for pic in source:
        print(pic)
    print("===========")
    print("本页一共有" + str(len(source)) + "页漫画")
    count = 0

    for item in source:
        count = count + 1
        if count < 1:
            continue
        print('' + str(count) + '')
        p = str(count) if count > 9 else "0"+str(count)
        name = "c"+chapter+"-p"+p
        download(item, "baishengnvyuheimushi", "c"+chapter, name)
        time.sleep(rd.randint(1, 3))
    print("---------爬取结束--------")
    return len(source)


def download(realUrl, dir, subDir, name):
    path = "D:/kingz/ANIME/"+dir+'/'+subDir
    if not os.path.exists(path):
        os.makedirs(path)
    try:
        urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name))
    except Exception as e:
        print("发生运行时异常:", e)
        print("跳过该页", name)
    finally:
        pass


chapter = "07"  # 章节名
url = "https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html"  # 爬取页面
text = get_dynamic_text_through_webdriver(url)
get_imgs(text, chapter)

改进空间:

1 要打开浏览器、打开网页、拉到底,才开始爬虫,效率太低了,是否可以采用异步框架(Ajax)解决动态网页爬虫,而不用webdriver+selenium。

2 研究有一些图片为什么会损坏,是否是网络的原因。

3 以上程序只获取一话而已,从父页面按照目录全部下载。可以一个一个下载,或者多线程同时来。

end.

posted @ 2023-07-18 18:03  倦鸟已归时  阅读(358)  评论(0编辑  收藏  举报