Selenium自动爬取网页数据——Python实现

Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样。支持的浏览器包括IE，Mozilla Firefox，Safari，Google Chrome，Opera，Edge等。这个工具的主要功能包括：测试与浏览器的兼容性——测试应用程序看是否能够很好得工作在不同浏览器和操作系统之上。测试系统功能——创建回归测试检验软件功能和用户需求。支持自动录制动作和自动生成.Net、Java、Perl等不同语言的测试脚本。selenium可以模拟真实浏览器进行自动化测试的工具，使用selenium也可以很好的应对很多网站的反爬措施，一些网站的跳转url并不会直接放到审查元素中，而是通过js嵌入其他特征来阻止requests类爬虫，而使用selenium可以解决大部分的问题，但是selenium的效率整体来说要比requests低。

一、Selenium成功案例

有很多公司和组织使用Selenium进行自动化测试，并取得了成功。以下是一些使用Selenium进行自动化测试的成功案例：

公司	应用
Google	Google使用Selenium进行Web应用程序测试，并且在GitHub上开源了自己的Selenium测试框架
Netflix	Netflix使用Selenium进行Web应用程序测试，确保其视频流媒体服务在不同的浏览器和平台上的兼容性和稳定性
Amazon	Amazon使用Selenium进行Web应用程序测试，以确保其电子商务平台的功能正常运行，并且能够满足用户的需求
Twitter	Twitter使用Selenium进行Web应用程序测试，以确保其社交媒体平台的各项功能和服务的稳定性和可靠性
Uber	Uber使用Selenium进行Web应用程序测试，以确保其移动应用程序和网站的正常运行，并且能够提供良好的用户体验
Microsoft	Microsoft使用Selenium进行Web应用程序测试，以确保其各种产品和服务的质量和性能达到预期标准
Adobe	Adobe使用Selenium进行Web应用程序测试，以确保其创意软件和数字媒体解决方案的功能和性能符合用户的期望
IBM	IBM使用Selenium进行Web应用程序测试，以确保其企业级软件和解决方案的质量和可靠性，满足客户的需求
NASA	NASA使用Selenium进行自动化测试，以确保其Web应用程序的正确性和稳定性
Airbnb	Airbnb使用Selenium进行Web应用程序测试，并且使用Selenium Grid来并行运行测试用例

这些成功案例表明，Selenium是一款强大的自动化测试工具，可以帮助组织和公司提高测试效率和测试质量，并减少测试成本。

二、Selenium学习指南

selenium中文文档: https://selenium-python-zh.readthedocs.io/en/latest/getting-started.html

元素	方法	异常
id	driver.find_element_by_id()	NoSuchElementException
name	driver.find_element_by_name()	NoSuchElementException
class_name	driver.find_element_by_class_name()	NoSuchElementException
link_txet	driver.find_element_by_link_text()	NoSuchElementException
tag_name	driver.find_element_by_tag_name()	NoSuchElementException
xpath	driver.find_element_by_xpath	NoSuchElementException
css	driver.find_element_by_css_selector()	NoSuchElementException
partial_link_text	driver.find_element_by_partial_link_text()	NoSuchElementException

三、Selenium应用实例

3.1爬取http://www.baidu.com

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from bs4 import BeautifulSoup
import re
import time

# 配置并获得WebDriver对象
driver = webdriver.Chrome()

try:
    # 发起get请求
    driver.get('http://www.baidu.com/')
    time.sleep(2)

    # 等待搜索框出现
    WebDriverWait(driver, 10).until(
        expected_conditions.presence_of_element_located((By.NAME, 'wd')))

    input_element = driver.find_element(By.NAME, 'wd')
    input_element.send_keys('python')
    input_element.submit()

    # 最多等待10秒直到浏览器标题栏中出现我希望的字样（比如查询关键字出现在浏览器的title中）
    WebDriverWait(driver, 10).until(
        expected_conditions.title_contains('python'))

    # 输出当前页面标题
    print("Page title:", driver.title)

    # 使用BeautifulSoup解析页面内容
    bsobj = BeautifulSoup(driver.page_source, 'html.parser')

    # 查找搜索结果数量
    num_text_element = bsobj.find('span', {'class': 'nums_text'})
    if num_text_element:
        # 输出原始格式的搜索结果数量
        print("Raw Search Results Count:", num_text_element.text)
        # 清理搜索结果数量中的非数字字符
        nums = filter(lambda s: s == ',' or s.isdigit(), num_text_element.text)
        # 输出清理后的搜索结果数量
        print("Cleaned Search Results Count:", ''.join(nums))

    # 查找搜索结果
    elements = bsobj.find_all('div', {'class': re.compile('c-container')})
    for element in elements:
        title = element.h3.a.text.strip() if element.h3 and element.h3.a else ""
        link = element.h3.a['href'] if element.h3 and element.h3.a else ""
        print('Title:', title)
        print('Link:', link)
        print('=' * 70)

finally:
    # 关闭浏览器
    driver.quit()

3.2 爬取https://www.00ksw.com/html/3/3804/极品家丁

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait

url = "https://www.00ksw.com/html/3/3804/"

chrome_options = Options()
chrome_options.add_argument("--headless")

with Chrome(options=chrome_options) as driver:
    driver.get(url)
    wait = WebDriverWait(driver, 10)
    wait.until(lambda d: "ml_list" in d.page_source)

    def get_article_content(article_url):
        driver.get(article_url)
        wait.until(lambda d: "articlecontent" in d.page_source)
        return driver.find_element(By.XPATH, "//div[@id='articlecontent']").text

    article_links = driver.find_elements(By.XPATH, "//div[@class='ml_list']//ul//li//a")
    
    articles_data = []
    for link in article_links:
        article_url = link.get_attribute("href")
        article_title = link.text
        articles_data.append([article_title, article_url])

    with open("结果.txt", "w", encoding="utf8") as file:
        for title, content_url in articles_data:
            content = get_article_content(content_url)
            print(title, content_url)
            file.write(f"{title}\n{content}\n")

3.3 爬取https://data.eastmoney.com/bbsj/202012/lrb.html

#爬取单页东方财富网上市公司利润表数据
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

web = webdriver.Chrome()
web.get('https://data.eastmoney.com/bbsj/202012/lrb.html')
element = web.find_element(By.CLASS_NAME,'dataview-body')  # 定位表格，element是WebElement类型
tr_list = element.find_element(By.TAG_NAME,"tbody").find_elements(By.TAG_NAME,"tr") # 进一步定位到每一行表格内容
data=[]  #建立空列表存储表格信息
for tr in tr_list:
    td_list=tr.find_elements(By.TAG_NAME,'td')
    lst = []  #创建空列表存储每行数据
    for td in td_list:
        lst.append(td.text)
    data.append(lst)
data=pd.DataFrame(data)
data.to_excel('东方财富网上市公司2020年年报利润表第一页数据.xlsx',index=False)

总结

Selenium是一款功能强大的自动化测试工具，但它也被广泛用于自动化爬虫。作为一个基于浏览器的自动化工具，Selenium可以模拟用户的操作行为，实现对网页内容的访问、提取和交互。这使得Selenium在自动化爬虫方面具有广泛的应用场景和强大的功能。
首先，Selenium可以用于网页内容的抓取和数据提取。通过模拟用户在浏览器中的操作，Selenium可以访问网页并提取其中的文本、图片、链接等信息。这使得Selenium成为一个强大的网络爬虫工具，可用于抓取各种类型的网页数据，包括新闻、商品信息、社交媒体内容等。
其次，Selenium可以用于网页交互和自动化测试。通过模拟用户与网页的交互过程，Selenium可以自动化执行各种操作，如填写表单、点击按钮、提交请求等。这使得Selenium不仅可以用于简单的数据抓取，还可以实现复杂的网站自动化测试和交互式爬虫。
此外，Selenium还支持多种浏览器和操作系统，包括Chrome、Firefox、IE等主流浏览器，以及Windows、Mac、Linux等常见操作系统。这使得Selenium具有良好的兼容性和可移植性，可以在不同的环境下进行自动化爬虫和测试。
总的来说，Selenium作为一款强大的自动化工具，不仅在自动化测试领域有着广泛的应用，还在自动化爬虫方面发挥着重要作用。通过模拟用户的操作行为，Selenium可以实现对网页内容的访问、提取和交互，为各行各业提供了高效、准确的数据收集和分析手段，推动着信息技术的发展和应用。