Python爬虫学习02(使用selenium爬取网页数据)

Python爬虫学习02(使用selenium爬取网页数据)

1.1，使用的库

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

1.2，流程

#1，打开浏览器
driver = webdriver.Chrome()
#该方式会显示浏览器界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)
## 该方式不会显示浏览器界面
#2，通过url打开界面
driver.get('http://xzqh.mca.gov.cn/map')
#3，对打开的界面进行操作
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))

1.3，用到的函数

1，driver.find_elements(by=By.OPTIONS,value='VALUES')
#作用：根据要求获取元素
#示例:driver.find_element(by=By.NAME,value='shengji')
#driver.find_element(by=By.CLASS_NAME,value="info_table")
#返回类型:list
2,Select(ELEMENT)
#作用：根据给定的元素获取select对象
#示例:s = Select(driver.find_element(by=By.NAME,value='shengji'))
#可以通过s.options[i]获取select中的选项
#示例:province = s1.options[i].text.split('（')[0]
#可以通过s.select_by_index()(或者select_by_value)来选择选项
#示例:s1.select_by_index(i)

1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息

from selenium import webdriver
from selenium.webdriver.common.by import By
import time as TIME

#打开浏览器
driver = webdriver.Chrome()
#通过下面的方式打开浏览器可以不打开图形界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)

driver.get('http://xzqh.mca.gov.cn/map')
#获取select元素
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))
#用字典保存province与index对应的关系
provinces={}
index = 0
for i in s1.options:
    provinces[i.text.split('（')[0]]=index
    index+=1

list = ['湖北省','湖南省','四川省']
for i in list:
    index = provinces[i]
    #获取select元素
    s1 = Select(driver.find_element(by=By.NAME, value='shengji'))
    #选择想要的省份
    s1.select_by_index(index)
    #获取提交按钮元素
    button = driver.find_element(by=By.CLASS_NAME,value='select_bn')
    #点击跳转
    button.click()
    #延迟等待网页加载
    TIME.sleep(2)
    #获取table元素
    table = driver.find_element(by=By.CLASS_NAME,value="info_table")
    #获取area元素
    areas = table.find_elements(by=By.NAME,value='hidzxs')
    for area in areas:
        print(i+' '+area.get_property('value'),area.get_property('alt'))
    #退回上一页
    driver.back()

1.4，优化

1.4.1，问题描述

使用上述方式，不论是否打开浏览器的图形界面都很慢，原因是Selenium页面加载策略的选择问题

selenium有三种页面加载策略：

策略	准备完成的状态	备注
normal	complete	默认情况下使用, 等待所有资源下载完成
eager	interactive	DOM访问已准备就绪, 但其他资源 (如图像) 可能仍在加载中
none	Any	完全不阻塞WebDriver

使用方式：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'eager'#此处选择策略
driver = webdriver.Chrome(options=options)
driver.get("http://www.google.com")
driver.quit()

在没有选择策略的时候，默认使用nomal策略，等待所有资源加载完才会返回，所以很慢。

2022年7月17日更新

上述设置策略的方式今日不知为何无法使用，改为以下方式

from selenium.webdriver import DesiredCapabilities
desired_capabilities = DesiredCapabilities.CHROME  # 修改页面加载策略
desired_capabilities["pageLoadStrategy"] = "eager"  # 注释这两行会导致最后输出结果的延迟，即等待页面加载完成再输出

posted @ 2022-07-13 20:50 xiiii 阅读(1155) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· Python爬虫学习01(使用requests爬取网页数据)

· 简单的数据清洗

· selenium+python的网站爬虫

· 爬虫 - helloworld

· 07selenium

阅读排行：
· 全程不用写代码，我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· .NET10 - 预览版1新功能体验（一）

公告

昵称： xiiii
园龄： 3年9个月
粉丝： 0
关注： 1

+加关注

2025年3月

日

一

二

三

四

五

六

xiiii

Python爬虫学习02(使用selenium爬取网页数据)

Python爬虫学习02(使用selenium爬取网页数据)

1.1，使用的库

1.2，流程

1.3，用到的函数

1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息

1.4，优化

1.4.1，问题描述

2022年7月17日更新

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜