xpath

（1）介绍

可在XML中查找信息
支持HTML的查找
通过元素和属性进行导航

 pip install lxml

 from lxml import etree
 
# 将源码转化为能被XPath匹配的格式
selector = etree.HTML(源码) 
 
# 返回为一列表
res = selector.xpath(表达式)

（2）使用

（1）路径表达式

表达式	描述	实例	解析
/	从根节点选取	`/body/div[1]`	选取根结点下的body下的第一个div标签
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置	`//a`	选取文档中所有的a标签
./	当前节点再次进行xpath	`./a`	选取当前节点下的所有a标签
@	选取属性	`//@calss`	选取所有的class属性

直接使用xpath语法查询出来的是Element对象,所以要使用for循环继续xpath
text()获取标签中的文本值

 from lxml import etree
 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
 
selector = etree.HTML(html_doc)
res = selector.xpath("//a")
 
print(res)  # [<Element a at 0x2155de71640>, <Element a at 0x2155e3976c0>, <Element a at 0x2155e397800>]
 
for result in res:
    href = result.xpath('./@href')
    href1 = result.xpath('./@href')[0]  # 获取标签里的href值
    href2 = result.xpath('./text()')[0]  # 获取标签里的文本值
 
    print(href)
    # ['http://example.com/elsie'] ['http://example.com/lacie'] ['http://example.com/tillie']
 
    print(href1)
    # http://example.com/elsie http://example.com/lacie http://example.com/tillie
 
    print(href2)
    # Elsie Lacie Tillie

（2）谓语

谓语用来查找某个特定的节点或者包含某个指定的值的节点。
谓语被嵌在方括号中。
在下面的表格中，我们列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式	结果
/ul/li[1]	选取属于 ul子元素的第一个 li元素。
/ul/li[last()]	选取属于 ul子元素的最后一个 li元素。
/ul/li[last()-1]	选取属于 ul子元素的倒数第二个 li元素。
//ul/li[position()<4]	选取最前面的三个属于 ul元素的子元素的 li元素。
//a[@title]	选取所有拥有名为 title的属性的 a元素。
//a[@title='xx']	选取所有 a元素，且这些元素拥有值为 xx的 title属性。
//a[@title>10] `> < >= <= !=`	选取 a元素的所有 title元素，且其中的 title元素的值须大于 10。
/body/div[@price>35.00]	选取body下price元素值大于35的div节点

（3）选取未知节点

【1】语法

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

【2】实例

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
/ul/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。
//node()	获取所有节点

（4）模糊查询

 //div[contains(@id, "he")]

posted @ 2024-03-31 17:04 ssrheart 阅读(20) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· BeatifulSoup

· Selenium框架

· 【4.0】爬虫之xpath

· Day 11 11.1 Xpath解析

· xpath

阅读排行：
· 无需6万激活码！GitHub神秘组织3小时极速复刻Manus，手把手教你使用OpenManus搭建本
· Manus爆火，是硬核还是营销？
· 终于写完轮子一部分：tcp代理了，记录一下
· 别再用vector＜bool＞了！Google高级工程师：这可能是STL最大的设计失误
· 单元测试从入门到精通

公告

昵称： ssrheart
园龄： 1年3个月
粉丝： 3
关注： 6

+加关注

2025年3月

日

一

二

三

四

五

六

随笔分类

随笔档案

阅读排行榜

评论排行榜

1. Django请求生命周期流程图(1)

heart

无限进步

xpath

xpath

（1）介绍

（2）使用

（1）路径表达式

（2）谓语

（3）选取未知节点

【1】语法

【2】实例

（4）模糊查询

公告

搜索

常用链接

合集

随笔分类

随笔档案

阅读排行榜

评论排行榜

最新评论

	from lxml import etree

	# 将源码转化为能被XPath匹配的格式
	selector = etree.HTML(源码)

	# 返回为一列表
	res = selector.xpath(表达式)

	from lxml import etree

	html_doc = """
	<html><head><title>The Dormouse's story</title></head>
	<body>
	<p class="title"><b>The Dormouse's story</b></p>

	<p class="story">Once upon a time there were three little sisters; and their names were
	<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
	<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
	<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
	and they lived at the bottom of a well.</p>

	<p class="story">...</p>
	"""

	selector = etree.HTML(html_doc)
	res = selector.xpath("//a")

	print(res) # [<Element a at 0x2155de71640>, <Element a at 0x2155e3976c0>, <Element a at 0x2155e397800>]

	for result in res:
	href = result.xpath('./@href')
	href1 = result.xpath('./@href')[0] # 获取标签里的href值
	href2 = result.xpath('./text()')[0] # 获取标签里的文本值

	print(href)
	# ['http://example.com/elsie'] ['http://example.com/lacie'] ['http://example.com/tillie']

	print(href1)
	# http://example.com/elsie http://example.com/lacie http://example.com/tillie

	print(href2)
	# Elsie Lacie Tillie