css选择器与xpath选择器与selenium测试工具

bs4：自己的选择器、css选择器

lxml：css选择器、xpath选择器

selenium：自己的选择器、css选择器、xpath选择器

scrapy框架：自己的选择器、css选择器、xpath选择器

css与xpath都可以鼠标右击copy selector和xpath

1. css选择器

Tag对象 . select("css选择器")

css选择器和前端一样：#id、.类名、div > p(找儿子)、div p(找子孙)、div a:last-child（找最后一个a标签）

详见官方文档

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
# CSS 选择器
print(soup.p.select('.sister'))  # 找sister类
print(soup.select('.sister span'))  # 找sister类下的所有span

print(soup.select('#link1'))  # 找id是link1
print(soup.select('#link1 span'))  # 找id是link1下的所有span

# 获取属性 .attrs
print(soup.select('#list-2 h1')[0].attrs)

# 获取内容 .get_text()
print(soup.select('#list-2 h1')[0].get_text())

2. xpath选择器

xpath是一门从xml文档中查找信息的语言

基本用法：

/a      从根下找a。找到全部返回，一级节点下没有返回None

//a     从根下节点开始找a标签（子孙中所有的）

.       选取当前节点

..      选取当前节点的父节点

/@属性名    选取属性

/text()    取值

使用详解：

doc = '''
<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html' id="xxx">Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <h5>test</h5>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
   <a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
  </div>
 </body>
</html>
'''

from lxml import etree
html = etree.HTML(doc)
# 1 所有节点
a=html.xpath('//*')

# 2 指定节点（结果为列表）
a=html.xpath('//head')

# 3 子节点，子孙节点
a = html.xpath('//div/a')  # 所有div下的子a节点

# 4 父节点
a = html.xpath('//body//a[1]/..')  # .. 父级节点

# 5 属性匹配  [@href="image5.html"]
a = html.xpath('//body//a[@href="image5.html"]')

# 6 文本获取  /text()
a = html.xpath('//body//a[@href="image5.html"]/text()')

# 7 属性获取  /@属性名
a = html.xpath('//div/a/@href')
a = html.xpath('//div/a[2]/@href')

# 8 属性多值匹配  contains
# a = html.xpath('//body//a[@class="li"]')  # 如果class有多个就不能找到
a = html.xpath('//body//a[contains(@class, "li")]/text()')  # 需要使用contains查找到

# 9 多属性匹配 or 和 and
a=html.xpath('//body//a[contains(@class,"li") or @name="items"]/text()')
a=html.xpath('//body//a[contains(@class,"li") and @name="items"]/text()')

# 10 按序选择
# a=html.xpath('//a[2]/text()')
# a=html.xpath('//a[2]/@href')
# 取最后一个（了解）
# a=html.xpath('//a[last()]/@href')
# a=html.xpath('//a[last()]/text()')
# 位置小于3的
# a=html.xpath('//a[position()<3]/@href')
# a=html.xpath('//a[position()<3]/text()')
# 倒数第二个
# a=html.xpath('//a[last()-2]/@href')

# 11 节点轴选择
# ancestor：祖先节点
# 使用了* 获取所有祖先节点
# a=html.xpath('//a/ancestor::*')

# # 获取祖先节点中的div
# a=html.xpath('//a/ancestor::div')
# a=html.xpath('//a/ancestor::div/a[2]/text()')
# attribute：属性值
# a=html.xpath('//a[1]/attribute::*')
# a=html.xpath('//a[1]/@href')
# child：直接子节点
# a=html.xpath('//a[1]/child::*')
# a=html.xpath('//a[1]/img/@src')
# descendant：所有子孙节点
# a=html.xpath('//a[6]/descendant::*')

# following:当前节点之后所有节点(递归)
# a=html.xpath('//a[1]/following::*')
# a=html.xpath('//a[1]/following::*[1]/@href')
# following-sibling:当前节点之后同级节点（同级）
# a=html.xpath('//a[1]/following-sibling::*')
# a=html.xpath('//a[1]/following-sibling::a')
# a=html.xpath('//a[1]/following-sibling::*[2]')
# a=html.xpath('//a[1]/following-sibling::*[2]/@href')

print(a)

3. selenium

selenium最初是一个自动化测试工具，而爬虫中使用他是用来解决requests无法直接执行 js 代码的问题

selenium可以直接操控浏览器（火狐，谷歌，ie等），模拟人的行为去操控浏览器，如点击、跳转、输入、下拉等等

1. 基本使用

安装：pip3 install selenium

下载浏览器驱动（配套浏览器版本）：http://npm.taobao.org/mirrors/chromedriver/

from selenium import webdriver

# 指定驱动的位置（相对路径或者绝对路径）
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 指定访问网址
bro.get('https://www.baidu.com')

# 隐士等待（最多xx秒）
# 只有控件没有加载出来，才会等，控件一旦加载出来，直接就取到
bro.implicitly_wait(10)

bro.close()

2. 输入点击进入用法

from selenium import webdriver
import time

# 指定驱动的位置（相对路径或者绝对路径）
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

# 指定访问网址
bro.get('https://www.baidu.com')

# 隐士等待（最多xx秒）
# 只有控件没有加载出来，才会等，控件一旦加载出来，直接就取到
bro.implicitly_wait(10)

# 自带的解析器，查找输入框的空间
# 1、find_element_by_id  # id找
# 2、find_element_by_link_text   # a标签上的文字找
# 3、find_element_by_partial_link_text # a标签上的文字模糊
# 4、find_element_by_tag_name        # 根据标签名字找
# 5、find_element_by_class_name      # 根据类名字找
# 6、find_element_by_name            # name='xx' 根据name属性找
# 7、find_element_by_css_selector    # css选择器找
# 8、find_element_by_xpath           #xpath选择器找

# 找到输入框并在输入框中输入内容
# input_search = bro.find_element_by_xpath('//*[@id="kw"]')
input_search = bro.find_element_by_css_selector('#kw')
input_search.send_keys('京东')

# 找到搜索按钮
enter = bro.find_element_by_id('su')
# 点击按钮
enter.click()

time.sleep(5)

bro.close()

3. 模拟登录

# 登录按钮
submit_button = bro.find_element_by_link_text('登录')
submit_button.click()
# 点击账号登录
user_button = bro.find_element_by_id('TANGRAM__PSP_10__footerULoginBtn')
user_button.click()
# 输入账号
user = bro.find_element_by_id('TANGRAM__PSP_10__userName')
user.send_keys('1452518231@qq.com')
# 输入密码
pwd = bro.find_element_by_id('TANGRAM__PSP_10__password')
pwd.send_keys('123456')
# 点击登录
submit = bro.find_element_by_id('TANGRAM__PSP_10__submit')
submit.click()

time.sleep(10)

bro.close()

4. 获取cookie

登录后拿到cookie就可以搭建cookies池，通过代理池与cookies池就可以解决封账号与封ip的问题了（弄一堆小号获取cookie）

# bro=webdriver.Chrome(executable_path='./chromedriver')
# bro.get("https://www.baidu.com")
# print(bro.get_cookies())
# bro.close()

5. 无界面浏览器

对于像Linux系统这种没有界面的，我们可以通过操作无界面浏览器来获取信息

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('window-size=1920x3000')  #指定浏览器分辨率
chrome_options.add_argument('--disable-gpu')  #谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--hide-scrollbars')  #隐藏滚动条, 应对一些特殊页面
chrome_options.add_argument('blink-settings=imagesEnabled=false')  #不加载图片, 提升速度
chrome_options.add_argument('--headless')  #浏览器不提供可视化页面. linux下如果系统不支持可视化不加这条会启动失败

bro = webdriver.Chrome(executable_path='./chromedriver.exe', options=chrome_options)

# 指定访问网址
bro.get('https://www.baidu.com/')


print(bro.get_cookies())

time.sleep(10)

bro.close()

6. 获取标签文本与属性

tag = bro.find_element_by_xpath('//*[@id="lg"]/map/area')
# (重点：获取属性)
print(tag.get_attribute('src'))
print(tag.get_attribute('href'))
#(重点：获取文本)
print(tag.text)

# #获取标签ID，位置，名称，大小（了解）
# print(tag.id)
# print(tag.location)
# print(tag.tag_name)
# print(tag.size)

7. 其他用法

隐私等待：只有控件没有加载出来，才会等，控件一旦加载出来，直接就取到

显示等待：每个控件都会等待，不要使用

tag.click()  # 点击
tag.clear()  # 清空输入框
tag.send_keys('xxx')  # 往输入框输入内容

执行js：

#9 执行js
import time
bro=webdriver.Chrome(executable_path='./chromedriver')

bro.get("https://www.cnblogs.com")
# 执行js代码
bro.execute_script('alert(1)')
# window.scrollTo(0,document.body.scrollHeight)  # js代码滚动屏幕到底
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

time.sleep(5)
bro.close()

模拟浏览器前进后退：

# 10 模拟浏览器前进后推
import time
bro=webdriver.Chrome(executable_path='./chromedriver')

bro.get("https://www.cnblogs.com")
time.sleep(1)
bro.get("https://www.baidu.com")
time.sleep(1)
bro.get("https://www.jd.com")

#退到上一个
bro.back()
time.sleep(1)
# 前进一下
bro.forward()

time.sleep(5)
bro.close()

选项卡管理：

# 10 选项卡管理
import time
from selenium import webdriver

browser=webdriver.Chrome(executable_path='./chromedriver')
browser.get('https://www.baidu.com')
browser.execute_script('window.open()')

print(browser.window_handles) #获取所有的选项卡
browser.switch_to_window(browser.window_handles[1])
browser.get('https://www.taobao.com')
time.sleep(2)
browser.switch_to_window(browser.window_handles[0])
browser.get('https://www.sina.com.cn')
browser.close()

异常处理：用try except Exception as e:

# 11 异常处理
from selenium import webdriver

try:
    browser=webdriver.Chrome(executable_path='./chromedriver')
    browser.get('http://www.baidu.com')
    browser.find_element_by_id("xxx")

except Exception as e:
    print(e)
finally:  # 不管是否有异常都关闭
    browser.close()

4. 练习爬取京东商品信息

from selenium import webdriver
import time

from selenium.webdriver.common.keys import Keys

def get_goods(bro):
    # 下拉到底部，防止还有未加载的
    # bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
    # 获取所有的商品
    # find_elements_by_class_name  找所有
    # find_element_by_class_name   找一个
    li_list = bro.find_elements_by_class_name('gl-item')
    for li in li_list:
        # 查单个商品的url，img，价格，名字
        url = li.find_element_by_css_selector('.p-img>a').get_attribute('href')
        if not url:
            continue
        url_img = li.find_element_by_css_selector('.p-img img').get_attribute("src")
        # print(url)
        # print(url_img)  # 发现后面的img没有src，链接在data-lazy-img中
        if not url_img:
            url_img = "https:"+li.find_element_by_css_selector('.p-img img').get_attribute('data-lazy-img')
        price = li.find_element_by_css_selector('.p-price i').text
        name = li.find_element_by_css_selector('.p-name em').text
        print(f'''
        商品名：{name},
        商品图片：{url_img},
        商品价格：{price},
        商品链接：{url}
        ''')

        # 查找下一页 find_element_by_partial_link_text 模糊查找
        next = bro.find_element_by_partial_link_text('下一页')
        time.sleep(1)
        next.click()
        # 下一页继续抓取
        get_goods(bro)
try:
    # 指定驱动的位置（相对路径或者绝对路径）
    bro = webdriver.Chrome(executable_path='./chromedriver.exe')

    # # 指定访问网址
    bro.get('https://www.jd.com/')

    # 隐士等待（最多xx秒）
    # 只有控件没有加载出来，才会等，控件一旦加载出来，直接就取到
    bro.implicitly_wait(20)

    input_search = bro.find_element_by_id('key')
    input_search.send_keys('手机')

    # # 找到搜索按钮
    # enter = bro.find_element_by_id('su')
    # # 点击按钮
    # enter.click()

    # 模拟键盘操作
    input_search.send_keys(Keys.ENTER)
    get_goods(bro)
    time.sleep(10)
except Exception as e:
    print(e)
finally:
    bro.close()

posted @ 2020-04-09 22:46 Mr沈阅读(496) 评论(0) 编辑收藏举报

刷新页面返回顶部

Mr.沈的技术栈

css选择器与xpath选择器与selenium测试工具

css选择器与xpath选择器与selenium测试工具

1. css选择器

2. xpath选择器

3. selenium

1. 基本使用

2. 输入点击进入用法

3. 模拟登录

4. 获取cookie

5. 无界面浏览器

6. 获取标签文本与属性

7. 其他用法

4. 练习爬取京东商品信息

公告