动态加载页面的爬虫方法之Selenium
任务:爬取果果漫画网站里《白圣女与黑牧师》第七话的所有图片(https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html)。
(最新修改:这个网站已经挂了,不过该篇叙述的技术依然好用)
首先,可以直接手动拉到网页最下面,然后把F12里面的网页节点元素复制成文本,去获取目标进行下载,代码如下,用到的库BeautifulSoup:
import os import urllib.request import re from bs4 import BeautifulSoup as bs import random as rd import time def get_imgs(text): soup = bs(text, features="lxml") imgss = soup.find_all("img") print(imgss) pattern = re.compile('src=\"([^\"]+)') source = pattern.findall(str(imgss)) print("===========") for pic in source: print(pic) print("===========") print("本页一共有" + str(len(source)) + "页漫画") count = 0 for item in source: count = count + 1 if count < 11: continue print('第' + str(count) + '个') p = str(count) if count > 9 else "0"+str(count) name = "c05-p"+p download(item, "manga-白圣女与黑牧师", "c06", name) time.sleep(rd.randint(2, 4)) print("---------爬取结束--------") return len(source) def download(realUrl, dir, subDir, name): path = subDir + '/' if not os.path.exists(path): os.makedirs(path) try: urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name)) except Exception as e: print("发生运行时异常:", e) finally: pass text = """ <div id="images"><img src="https://n1a.zhjyu.net/images/p/c9/6b/888e04ccb13ae1d2d075142f4fe4.jpg" data-index="1" style="display: 略... /190aaf19e561363d805e437485da.jpg" data-index="40" style="display: inline;"><p class="img_info">(40/40)</p></div> """ get_imgs(text)
这样虽可行,但是仍然是手动的,采用自动的方法获取页面文档方法如下:
由于网页是动态加载的,无法直接通过URL获取完整的页面文档,因此就要模拟浏览器网页下拉的操作了,用到库selenuim。
第一步要做的就是,用selenium打开浏览器然后打开指定网页。
def get_dynamic_text_through_webdriver(): driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe") driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html") # 第七话
报错如下:
Traceback (most recent call last): File "guoguomh.py", line 65, in <module> get_dynamic_text_through_webdriver() File "guoguomh.py", line 15, in get_dynamic_text_through_webdriver driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chrome.exe") ......略 selenium.common.exceptions.WebDriverException: Message: Service C:/Program Files (x86)/Google/Chrome/Application/chrome.exe unexpectedly exited. Status code was: 0
代码修改如下:
def get_dynamic_text_through_webdriver(): driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe") driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html") # 第七话
仍然报错:
......略
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 92 Current browser version is 114.0.5735.199 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
可能是chromedriver的版本问题,按照报错信息里面给定的Browser Version版本,在网站“https://chromedriver.chromium.org/downloads”里下载对应版本的driver(114版本就可)。然后解压,把chromedriver.exe放到Chromer浏览器chrome.exe同一目录下即可。
启动成功。
接下来就是爬虫的经典流程。
遇到问题如下
def get_dynamic_text_through_webdriver(): driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe") try: driver.get("https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html") # 第七话 except Exception as e: print(e) driver.maximize_window() driver.execute_script("document.getElementById('images').scrollTo=12000") time.sleep(6) # 定时6s等待内容加载 print("定时结束,爬取内容") html_source = driver.page_source print(html_source)
由此获取到的页面源码不标准,因此不应使用driver.page_source获取页面文档,而应是执行JavaScript代码返回页面文档或者直接调用driver的xpath功能。
完整代码如下(尚有缺陷,个别图片下载损坏,将用try-catch来跳过,避免程序中断):
# -*- coding: utf-8 -*- # @Author : Zhao Ke # @Time : 2023-07-18 11:58 import os import re import time import random as rd import urllib.request from bs4 import BeautifulSoup as bs from selenium import webdriver def get_dynamic_text_through_webdriver(page_url): driver = webdriver.Chrome(executable_path="C:/Program Files (x86)/Google/Chrome/Application/chromedriver.exe") try: driver.get(page_url) # 第七话 except Exception as e: print(e) driver.maximize_window() driver.execute_script("document.getElementById('images').scrollTo=12000") time.sleep(2) # 定时等待加载 print("定时结束,爬取内容") # 初值 html_source = driver.find_element_by_xpath('//*[@id="images"]') indices = html_source.text[-7:] # 获取指定元素里的文档里面的文本 curr = int(indices[1:3]) maxi = int(indices[4:6]) # 迭代直到下拉到最后 while curr < maxi: driver.execute_script("window.scrollBy(0, 10000)") html_source = driver.find_element_by_xpath('//*[@id="images"]') indices = html_source.text[-7:] # 获取指定元素里的文档里面的文本 curr = int(indices[1:3]) maxi = int(indices[4:6]) time.sleep(0.5) return html_source.get_attribute("outerHTML") # 获取指定元素的文档 def get_imgs(text, chapter): soup = bs(text, features="lxml") imgss = soup.find_all("img") print(imgss) pattern = re.compile('src=\"([^\"]+)') source = pattern.findall(str(imgss)) print("===========") for pic in source: print(pic) print("===========") print("本页一共有" + str(len(source)) + "页漫画") count = 0 for item in source: count = count + 1 if count < 1: continue print('第' + str(count) + '个') p = str(count) if count > 9 else "0"+str(count) name = "c"+chapter+"-p"+p download(item, "baishengnvyuheimushi", "c"+chapter, name) time.sleep(rd.randint(1, 3)) print("---------爬取结束--------") return len(source) def download(realUrl, dir, subDir, name): path = "D:/kingz/ANIME/"+dir+'/'+subDir if not os.path.exists(path): os.makedirs(path) try: urllib.request.urlretrieve(realUrl, '{0}/{1}.jpg'.format(path, name)) except Exception as e: print("发生运行时异常:", e) print("跳过该页", name) finally: pass chapter = "07" # 章节名 url = "https://www.guoguomh.com/manhua/baishengnvyuheimushi/496885.html" # 爬取页面 text = get_dynamic_text_through_webdriver(url) get_imgs(text, chapter)
改进空间:
1 要打开浏览器、打开网页、拉到底,才开始爬虫,效率太低了,是否可以采用异步框架(Ajax)解决动态网页爬虫,而不用webdriver+selenium。
2 研究有一些图片为什么会损坏,是否是网络的原因。
3 以上程序只获取一话而已,从父页面按照目录全部下载。可以一个一个下载,或者多线程同时来。
end.