动态加载页面的爬虫方法之Selenium

任务：爬取果果漫画网站里《白圣女与黑牧师》第七话的所有图片（https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html）。

（最新修改：这个网站已经挂了，不过该篇叙述的技术依然好用）

首先，可以直接手动拉到网页最下面，然后把F12里面的网页节点元素复制成文本，去获取目标进行下载，代码如下，用到的库BeautifulSoup：

import os
import urllib.request
import re
from bs4 import BeautifulSoup as bs
import random as rd
import time
def get_imgs(text):
    soup = bs(text, features="lxml")
    imgss = soup.find_all("img")
    print(imgss)
    pattern = re.compile('src=\"([^\"]+)')
    source = pattern.findall(str(imgss))
    print("===========")
    for pic in source:
        print(pic)
    print("===========")
    print("本页一共有" + str(len(source)) + "页漫画")
    count = 0

    for item in source:
        count = count + 1
        if count < 11:
            continue
        print('第' + str(count) + '个')
        p = str(count) if count > 9 else "0"+str(count)
        name = "c05-p"+p
        download(item, "manga-白圣女与黑牧师", "c06", name)
        time.sleep(rd.randint(2, 4))

    print("---------爬取结束--------")
    return len(source)

def download(realUrl, dir, subDir, name):
    path = subDir + '/'
    if not os.path.exists(path):
        os.makedirs(path)
    try:
        urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name))
    except Exception as e:
        print("发生运行时异常：", e)
    finally:
        pass

text = """
<div id="images"><img src="https://n1a.zhjyu.net/images/p/c9/6b/888e04ccb13ae1d2d075142f4fe4.jpg" data-index="1" style="display: 
略...
/190aaf19e561363d805e437485da.jpg" data-index="40" style="display: inline;"><p class="img_info">(40/40)</p></div>
"""

get_imgs(text)

这样虽可行，但是仍然是手动的，采用自动的方法获取页面文档方法如下：

由于网页是动态加载的，无法直接通过URL获取完整的页面文档，因此就要模拟浏览器网页下拉的操作了，用到库selenuim。

第一步要做的就是，用selenium打开浏览器然后打开指定网页。

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")
    driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话

报错如下：

Traceback (most recent call last):
  File "guoguomh.py", line 65, in <module>
    get_dynamic_text_through_webdriver()
  File "guoguomh.py", line 15, in get_dynamic_text_through_webdriver
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe")
......略
selenium.common.exceptions.WebDriverException: Message: Service C:/Program Files (x86)/Google/Chrome/Application/chrome.exe unexpectedly exited. Status code was: 0

代码修改如下：

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话

仍然报错：

......略
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 92
Current browser version is 114.0.5735.199 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe

可能是chromedriver的版本问题，按照报错信息里面给定的Browser Version版本，在网站“https://chromedriver.chromium.org/downloads”里下载对应版本的driver（114版本就可）。然后解压，把chromedriver.exe放到Chromer浏览器chrome.exe同一目录下即可。

启动成功。

接下来就是爬虫的经典流程。

遇到问题如下

def get_dynamic_text_through_webdriver():
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    try:
        driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html")  # 第七话
    except Exception as e:
        print(e)
    driver.maximize_window()
    driver.execute_script("document.getElementById('images').scrollTo=12000")
    time.sleep(6)  # 定时6s等待内容加载
    print("定时结束，爬取内容")
    html_source = driver.page_source
    print(html_source)

由此获取到的页面源码不标准，因此不应使用driver.page_source获取页面文档，而应是执行JavaScript代码返回页面文档或者直接调用driver的xpath功能。

完整代码如下（尚有缺陷，个别图片下载损坏，将用try-catch来跳过，避免程序中断）：

# -*- coding: utf-8 -*-
# @Author : Zhao Ke
# @Time : 2023-07-18 11:58
import os
import re
import time
import random as rd
import urllib.request
from bs4 import BeautifulSoup as bs
from selenium import webdriver


def get_dynamic_text_through_webdriver(page_url):
    driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe")
    try:
        driver.get(page_url)  # 第七话
    except Exception as e:
        print(e)
    driver.maximize_window()
    driver.execute_script("document.getElementById('images').scrollTo=12000")
    time.sleep(2)  # 定时等待加载
    print("定时结束，爬取内容")

    # 初值
    html_source = driver.find_element_by_xpath('//*[@id="images"]')
    indices = html_source.text[-7:]  # 获取指定元素里的文档里面的文本
    curr = int(indices[1:3])
    maxi = int(indices[4:6])
    # 迭代直到下拉到最后
    while curr < maxi:
        driver.execute_script("window.scrollBy(0, 10000)")
        html_source = driver.find_element_by_xpath('//*[@id="images"]')
        indices = html_source.text[-7:]  # 获取指定元素里的文档里面的文本
        curr = int(indices[1:3])
        maxi = int(indices[4:6])
        time.sleep(0.5)
    return html_source.get_attribute("outerHTML")  # 获取指定元素的文档


def get_imgs(text, chapter):
    soup = bs(text, features="lxml")
    imgss = soup.find_all("img")
    print(imgss)
    pattern = re.compile('src=\"([^\"]+)')
    source = pattern.findall(str(imgss))
    print("===========")
    for pic in source:
        print(pic)
    print("===========")
    print("本页一共有" + str(len(source)) + "页漫画")
    count = 0

    for item in source:
        count = count + 1
        if count < 1:
            continue
        print('第' + str(count) + '个')
        p = str(count) if count > 9 else "0"+str(count)
        name = "c"+chapter+"-p"+p
        download(item, "baishengnvyuheimushi", "c"+chapter, name)
        time.sleep(rd.randint(1, 3))
    print("---------爬取结束--------")
    return len(source)


def download(realUrl, dir, subDir, name):
    path = "D:/kingz/ANIME/"+dir+'/'+subDir
    if not os.path.exists(path):
        os.makedirs(path)
    try:
        urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name))
    except Exception as e:
        print("发生运行时异常：", e)
        print("跳过该页", name)
    finally:
        pass


chapter = "07"  # 章节名
url = "https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html"  # 爬取页面
text = get_dynamic_text_through_webdriver(url)
get_imgs(text, chapter)

改进空间：

1 要打开浏览器、打开网页、拉到底，才开始爬虫，效率太低了，是否可以采用异步框架（Ajax）解决动态网页爬虫，而不用webdriver+selenium。

2 研究有一些图片为什么会损坏，是否是网络的原因。

3 以上程序只获取一话而已，从父页面按照目录全部下载。可以一个一个下载，或者多线程同时来。

end.

posted @ 2023-07-18 18:03 倦鸟已归时阅读(358) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

倦鸟已归时

人能常清静，天地悉皆归。

动态加载页面的爬虫方法之Selenium