selenium相关操作补充知识，iframe、防爬措施和案例思路介绍

selenium相关操作补充知识和案例思路介绍

selenium其他操作
动作链和iframe
seleuinm相关知识、思路和防爬措施
cookie登录案例
图片验证吗思路

selenium其他操作

获取属性

语法：

变量名.get_attrubute()

eg：

# 调用模块
from selenium import webdriver
import time

# 指定操作的浏览器驱动
bro = webdriver.Chrome('D:\python3.6.8\Scripts\chromedriver.exe')
# 控制浏览器访问网站数据
bro.get("https://tieba.baidu.com/f?ie=utf-8&kw=90%E5%90%8E%E7%BE%8E%E5%A5%B3")
# 获取a标签
tag=bro.find_element_by_tag_name('a')
# 获取href属性值
print(tag.get_attribute('href'))

获取文本内容

语法：

变量名.text

eg:

# 获取文本
print(tag.text)

获取标签ID，位置，名称，大小（了解)

语法：

变量.id
变量.location
变量.tag_name
变量.size

eg：

print(tag.id)
print(tag.location)
print(tag.tag_name)
print(tag.size)

模拟浏览器前进后退

eg：

# 后退
bro.back()
# 前进
bro.forward()

cookie管理

# 获取cookie
print(bro.get_cookies())
# 设置cookie
bro.add_cookie({'一拳':'nnn','nn':'jhh'})

运行js代码

调用模块

from selenium import webdriver

语法：

变量名.execute_script('代码')

eg:

# js代码：鼠标向下滚动
bro.execute_script('window.scrollTo(0,200)')

选项卡管理

eg：

# 打开一个新标签
bro.execute_script('window.open()')
# 获取所有的标签
print(bro.window_handles)

# 在打开第二个标签
bro.switch_to_window(bro.window_handles[1])
bro.get('https://www.taobao.com')
time.sleep(3)
# 打开第三个标签
bro.switch_to_window(bro.window_handles[0])
bro.get('https://www.sina.com.cn')

动作链和iframe

滑动验证码没有代码破解的必要，不如手动滑获取cookie即可

动作链一般位于页面上嵌套页面的iframe标签

执行代码和思路

# 调用模块
from selenium import webdriver
from selenium.webdriver import ActionChains
import time
# 使用驱动器
driver = webdriver.Chrome('D:\python3.6.8\Scripts\chromedriver.exe') 
# 访问网站 
driver.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable') # 
# 必须要指定iframe标签 
driver.switch_to.frame('iframeResult') 

# 获取移动图片位置 
sourse = driver.find_element_by_id('draggable') 
# 获取目的位置 
target = driver.find_element_by_id('droppable')

方法1：

基于同一个动作链串行执行(速度太快不合理，会网络级别高的网站发现)

# 获取动作对象
actions = ActionChains(driver)
# 把动作放到动作链中，准备串行执行
actions.drag_and_drop(sourse, target)
# 动作演示
actions.perform()

结果

方法2：

循环不同的动作链，每次移动的位移都不同（以免被网站发现）

# 声明动作对象
actions = ActionChains(driver)
# 获取移动图片位置
actions.click_and_hold(sourse)
# 原点与距离换算
distance = target.location['x'] - sourse.location['x']
# 起始移动举例换算
track = 0
# 距离比较
while track < distance:
# 移动
    actions.move_by_offset(xoffset=2, yoffset=0).perform()
# 改变移动距离
    track += 5
# 缓冲时间
    time.sleep(0.5)
# 结束动作
actions.release()
# 关闭驱动器
driver.close()

# 声明动作对象
actions = ActionChains(driver)
# 获取移动图片位置
actions.click_and_hold(sourse)
# 原点与距离换算
distance = target.location['x'] - sourse.location['x']
# 起始移动举例换算
track = 0
# 距离比较
while track < distance:
# 移动
    actions.move_by_offset(xoffset=2, yoffset=0).perform()
# 改变移动距离
    track += 5
# 缓冲时间
    time.sleep(0.5)
# 结束动作
actions.release()
# 关闭驱动器
driver.close()

iframe界面

一个页面上会叠加其他完整的html页面

该页面一般都是ifarme标签，内部有完整的html文档结构

查找该标签内部的标签时需要指定一个参数

eg：

driver.switch_to.frame('iframeResult')

seleuinm相关知识、思路和防爬措施

滑动验证码

针对滑动验证码可以通过seleuinm自动完成，也就是之前讲的动作链

#滑动验证码不推荐使用程序破解，太过繁琐，重在了解思路

注意：

'''
滑动验证码在拖动的时候速度不能太快 内部有监测机制
    速度过快一步到位会被认为是爬虫程序
'''

无界面操作

调用模块

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

执行内容

# 申明对象
chrome_options =Options()
# 设置无界面
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
bro = webdriver.Chrome('D:\python3.6.8\Scripts\chromedriver.exe',chrome_options=chrome_options)
bro.get('https://www.baidu.com')
# 获取页面HTML代码
print(bro.page_source)

针对selenuim防爬

加入下列代码即可避免被识别出来

from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitchers',['enable-automation'])
bro = webdriver.Chrome(options=option)

cookie登录案例

思路

1,用selenium模块打开网址
2设置时间让用户手动登录
3再用selecuinm获取cookie
4用requests使用cookie去模拟爬取数据

执行

# 调用模块
import requests
from selenium import webdriver
import time
import json

# 定义网络地址
url = 'https://account.cnblogs.com/signin?returnUrl=https%3A%2F%2Fwww.cnblogs.com%2F'
# 常见对象
driver = webdriver.Chrome('D:\python3.6.8\Scripts\chromedriver.exe')
# 访问网页
driver.get(url=url)
# 预留时间让用户输入用户名和密码
time.sleep(30)
# 刷新页面为获取cookie
driver.refresh()
# 获取登录成功之后服务端发返回的cookie数据
c = driver.get_cookies()
print(c)

保存数据

# 写入文件保存以备用
with open('xxx.txt', 'w') as f:
    # 以json格式写入
    json.dump(c, f)

去除数据发送请求模拟登录

# 设施cookies存储变量
cookies={}
# 打开文件
with open('xxx.txt', 'r') as f:
    # 输出数据到di
    di = json.load(f)
# 获取数据，把数据转化为requesrts使用的模式,只有name和value是所需要的
for cookie in di:
    # 数据转化
    cookies[cookie['name']] = cookie['value']
print(cookies)

# 使用该cookie完成请求
response = requests.get(url='https://i-beta.cnblogs.com/api/user', cookies=cookies)
# 选择编码
response.encoding = response.apparent_encoding
# 获取回复数据
print(response.text)

图片验证吗思路

思路1：

    图像识别技术
        软件:Tesseract-ocr 
        模块:pytesseract

思路2：

打码平台
    花钱买第三方服务
        先使用代码识别如果不想其实还有一帮员工肉眼识别

思路3：

自己人工识别

B站视屏案例思路

具体代码网址：

https://www.cnblogs.com/xiaoyuanqujing/articles/12016934.html
https://www.cnblogs.com/xiaoyuanqujing/articles/12014416.html

"""
b站有很视频是一分为二的
    分为视频(只有画面没有声音)和音频(视频配套的声音)
"""

思路

1，获取任意视频地址
2.分析页面，代开network一个为视屏每一个为音频

3.查看数据加载方式，url
4.向url发送请求，用requests获取
5.将数据存储如文件，循环发送请求加载，把数据写入文件即可

红素网小说案例

案例网址：https://www.cnblogs.com/xiaoyuanqujing/protected/articles/11868250.html

思路

1.小说详情页面鼠标左右键禁用，但是支持按F12调出控制台
2.小说文字不全是直接加载，查找相关二次请求

3.通过观察请求发现网址和请求体信息的对应关系

4，涉及到数据解密肯定需要写js代码 并且一般都会出现关键字decrypt，在source中点击top右键search in all file，搜索decrypt，可以发现多个数据

5.通过寻找content和data发现关键数据

6.点击内容查看数据，确定所要的内容为之前加密数字

7.获取content的数据和other的数据，得到js代码，这就需要进行专门的数据进行解密
8.只解密content的代码会有确实，也需要other代码
9.创建html文本储存content的js代码和other的js代码

返回目录

posted @ 2021-09-29 14:49 微纯册阅读(183) 评论(0) 编辑收藏举报

刷新页面返回顶部