爬虫之Xpath

Xpath

Xpath即为XML路径语言，他是一种用来确定XML文档中某部分位置的语言，同样适用于HTML文档中的检索

<ul class="ook_list">
    <li>
        <title class="book_001">Harry Potter</title>
        <author>J K . Rowling</author>
        <year>2005</year>
        <price>69.99</price>
    </li>
    
    <li>
        <title class="book_002">Spider</title>
        <author>Chancey</author>
        <year>2019</year>
        <price>49.99</price>
    </li>
</ul>

匹配语法：

1、查找所有的li标签
//li
2、查找li节点下的title子节点，class为”book_001“的节点
//li/title[@class="book_001"]
3、查找li下的所有title,class属性的值
//li/title/@class

只要涉及条件，加[]

只要获取属性值，加@

一、选取节点

// : 从所有节点中查找（包括子节点和后代节点）
@ ：获取属性值

# 属性值作为条件
//div/[@class="movie"]

# 直接获取属性值
//div/a/@src

二、匹配多路径

xpath1 | xpath2 | xpath3

三、常用函数

contains()匹配属性值中包含某些字符串的节点

查找class属性值中包含"book_"的title节点

//title[contains(@class,"book_")]
text()获取节点的文本内容

查找所有书籍的名称

//ul[@class="book_list"]/li/title/text()

四、案例

以猫眼电影为例(top榜)

获取电影信息的dd节点

//dl[@class="board-wrapper"]/dd
获取电影名称

//dl[@class="board-wrapper"]/dd//p[@class="name"]/a/text()
获取电影主演

//dl[@class="board-wrapper"]/dd//p[@class="star"]/text()
获取上映时间

//dl[@class="board-wrapper"]/dd//p[@class="releasetime"]/text()

LXML

Xpath是XML语言中的匹配工具，而在Python中无法使用，就用到了lxml

一、安装

pip install lxml

二、使用流程

导入：from lxml import etree
创建解析对象：parse_html = etree.HTML(html)
解析对象调用Xpath：parse_html.xpath("xpath表达式")

类似于正则表达式

只要调用xpath，结果一定为列表

示例：

<div class="wrapper">
    <i class="iconfont icon-back" id="back"></i>
    <a href="/" id="channel">新浪社会</a>
    <ul id="nav">
        <li><a href="http://news.sina.com.cn/" target="_blank"><b>新闻</b></a></li>
        <li><a href="http://mil.news.sina.com.cn/" target="_blank">军事</a></li>
        <li><a href="https://news.sina.com.cn/china/" target="_blank">国内</a></li>
        <li><a href="http://news.sina.com.cn/world/" target="_blank">国际</a></li>
        <li><a href="http://finance.sina.com.cn/" target="_blank"><b>财经</b></a></li>
        <li><a href="http://finance.sina.com.cn/stock/" target="_blank">股票</a></li>
        <li><a href="http://finance.sina.com.cn/fund/" target="_blank">基金</a></li>
        <li><a href="http://finance.sina.com.cn/forex/" target="_blank">外汇</a></li>
    </ul>
    <i class="iconfont icon-liebiao" id="menu"></i>
</div>

'''
获取所有 a 节点的文本内容
获取所有 a 节点的 href 的属性值
获取所有 a 节点的 href 的属性值，但是不包括 /
获取 图片、军事、...，不包括新浪社会
'''

from lxml import etree

import html

# 1.
parse_html = etree.HTML(html.html)
r_list1 = parse_html.xpath("//a/text()")

# 2.
r_list2 = parse_html.xpath("//a/@href")

# 3.
r_list3 = parse_html.xpath('//ul[@id="nav"]/li/a/@href')

# 4.
r_list4 = parse_html.xpath('//ul[@id="nav"]/li/a/text()')

三、高级用法

基准xpath表达式：得到节点对象列表
遍历以上列表，再次使用xpath提取信息

注意：在遍历的之后使用xpath则以.开头，代表当前节点

四、小结

节点对象列表
- //div
- //div[@class="student"]
- //div/a[@title="student"]/span
- ...
字符串列表
- @src
- @href
- text()

在同一个网页中匹配多个信息，最好先匹配到大节点，然后遍历该列表或者字典，然后依次匹配需要得到的信息。

在xpath语法中，如果前边的大节点遍历之后，在xpath语法中以.开头，表示继续遍历体的节点

posted @ 2019-09-07 09:37 ChanceySolo 阅读(250) 评论(0) 编辑收藏举报

刷新页面返回顶部

素心

人生苦短，我用Python。