爬虫学习基础2

`selenium`

安装:

pip install selenium

反爬库

- undetected_chromedriver

安装浏览器驱动(各个浏览器的驱动是不一样的,推荐chrome)

- https://registry.npmmirror.com/binary.html?path=chromedriver/
	 - 查看自己chrome浏览器的版本,我的是: 96.0.4664.45
	 	- 注意事项:如果找不到对应的版本,就往前面推,找最近的版本即可
	 		- 比如: 45的版本找不到,就找44,44找不到就找43,以此类推...
	 - 有Linux版,苹果版,和 windows版,我下载的是: chromedriver_win32.zip(64位的系统也选这个...)

安装之坑

- selenium 使用3的版本,若使用4的版本,以下demo代码需更改
- urllib3 使用1.26.2的版本,新版的urllib3与 selenium3版本不兼容,也会报错

简单使用

from selenium.webdriver import Chrome

web = Chrome(executable_path='chromedriver.exe')
url = 'http://www.baidu.com'
web.get(url)
print(web.title) # 百度一下,你就知道

简化写法,可以把chromedriver.exe放到python根目录(和python.exe位置相同),让脚本自动去找

from selenium.webdriver import Chrome

# web = Chrome(executable_path='chromedriver.exe')
# 这样子也是可以的
web = Chrome()
url = 'http://www.baidu.com'
web.get(url)
print(web.title)

`拉勾网招牌`实例

import time

from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys

web = Chrome()
url = 'https://www.lagou.com/'
web.get(url)
x_btn = web.find_element_by_xpath('//*[@id="cboxClose"]') # 获取打叉按钮
x_btn.click() # 点击打叉
time.sleep(2) # 等待余下html结构加载完成
# 找到搜索框并输入'python',按下回车键搜索
web.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)


import time

from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys


web = Chrome()
url = 'https://www.lagou.com/'
web.get(url)
x_btn = web.find_element_by_xpath('//*[@id="cboxClose"]')
x_btn.click()
time.sleep(2)
web.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)

divs = web.find_elements_by_xpath('//*[@id="jobList"]/div[1]/div')
time.sleep(2)
# 执行JS脚本,关闭广告框
web.execute_script("""
    var a = document.getElementByClassName("un-login-banner")[0];
    a.parentNode.removeChild(a);
""")
for div in divs:
    title = div.find_element_by_xpath('.//*[@id="openWinPostion"]')
    # 切换窗口
    web.switch_to_window(web.window_handles[-1])
    job_detail = web.find_element_by_xpath('xxxx')
    text = job_detail.text
    print(text)
    time.sleep(1)
    web.close() # 关闭当前窗口
    web.switch_to_window(web.window_handles[0]) # 切换视角

web.quit() # 关闭浏览器

`iframe`案例

from selenium.webdriver import Chrome

web = Chrome()
web.get('http://www.wbdy.tv/play/30288_1_1.html')

iframe = web.find_elements_by_xpath('//*[id="mplay"]')
web.switch_to_frame(iframe) # 进入ifram
# .......
web.switch_to.parent_frame() # 跳出ifram

处理下拉框案例

import time

from selenium.webdriver import Chrome
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.options import Options


web = Chrome()
sel = web.find_element_by_xpath('//*[@id="OptionDate"]')
sel_new = Select(sel)
# print(len(sel_new.options)) # 所有选项的长度
# 常用的选择器,index用得比较多
# sel_new.select_by_index()
# sel_new.select_by_value()
# sel_new.select_by_visible_text()

for i in range(len(sel_new.options)):
    sel_new.select_by_index(0) # 切换到第一个选项
    trs = web.find_element_by_xpath('//*xxx') # 遍历行数获取信息
    for tr in trs:
        print(tr.text)

其他功能

# 获取页面代码(经过JS渲染过的代码,注意区别于'页面源码')
page_source = web.page_source

把selenium配置成无头浏览器(即隐藏浏览器)

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

opt = Options() # 先生成配置项
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')

web = Chrome(options=opt) # 传入配置项
web.get('http://www.baidu.com')
print(web.title) # 百度一下,你就知道(无浏览器显示)

处理`验证码`

以超级鹰作为示例

from selenium.webdriver import Chrome

web = Chrome()
web.get('https://www.chaojiying.com/user/login/')
# 返回的是一个二进制数据,一般无需保存,直接怼到验证码接口即可
png = web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/div/img').screenshot_as_png
# with open('code.png','wb') as file:
#    file.write(png)
# 保存图片
png = web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/div/img').screenshot(filename)
# 转换为 base64
png = web.find_element_by_xpath('/html/body/div[3]/div/div[3]/div[1]/form/div/img').screenshot_as_base64
print('完成')

发表于 2023-06-30 10:34 清安宁阅读(20) 评论(0) 编辑收藏举报

selenium

拉勾网招牌实例

iframe案例

处理验证码

`selenium`

`拉勾网招牌`实例

`iframe`案例

处理`验证码`