xpath解析html

XPath

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素，并且 XQuery 和 XPointer 都构建于 XPath 表达之上。

在爬虫中主要用于对html进行解析

要解析的html:

from lxml import etree

# 要解析的html标签
html_str = """
<li data_group="server" class="content"> 
    <a href="/commands.html" class="index" name="a1">第一个a标签</a>
    <a href="/commands.html" class="index2" name="a2">第二个a标签</a>
    <a href="/commands/flushdb.html">
        <span class="first">
            这是第一个span标签
            <span class="second">
            这是第二个span标签,第一个下的子span标签
            </span>
        </span>
        <span class="third">这是第三个span标签</span>
        <h3>这是一个h3</h3>
    </a></li>
"""

1. 对文件进行读取解析操作

# 解析xpath.html文件
html = etree.parse('xpath.html')
print(html, type(html))  # <lxml.etree._ElementTree object at 0x00000141445A08C8> <class 'lxml.etree._ElementTree'>
a = html.xpath("//a")
print(a, type(a))  # [<Element a at 0x141445a0808>, <Element a at 0x141445a0908>, <Element a at 0x141445a0948>] <class 'list'>

2. 找标签的属性信息

# 找到所有a标签的href和text
a = html.xpath("//a")
a_href = html.xpath("//a/@href")
a_text = html.xpath("//a/text()")
print(a, type(a))   # [<Element a at 0x191c1691888>, <Element a at 0x191c1691848>, <Element a at 0x191c1691948>] <class 'list'>
print(a_href, type(a_href))  # ['/commands.html', '/commands.html', '/commands/flushdb.html'] <class 'list'>
print(a_text, type(a_text), len(a_text))

3. 找到指定的标签

# 找到class="first"的span标签
span_first = html.xpath("//span[@class='first']")
span_first_text = html.xpath("//span[@class='first']/text()")
print(span_first, type(span_first))   # [<Element a at 0x191c1691888>, <Element a at 0x191c1691848>, <Element a at 0x191c1691948>] <class 'list'>
print(span_first_text, type(span_first_text))  # ['这是第一个span标签\n\t\t', '\n\t'] <class 'list'>
# 找到第二个a标签
a_second = html.xpath("//a")[1]
# print(a_second, type(a_second))    # <Element a at 0x23844950808> <class 'lxml.etree._Element'>
a_second_text = a_second.text
# ### a_second_t = a_second.get_text
# ###print(a_second_t)
print(a_second_text, type(a_second_text))   # 第二个a标签 <class 'str'>
a_second_href = a_second.get("href")
print(a_second_href)  #  /commands.html

4. 处理子标签和后代标签

# 找到li标签下的a标签下的所有span标签
span_all = html.xpath("//li/a//span")
print(span_all, type(span_all), len(span_all))
# [<Element span at 0x2d9dcd18888>, <Element span at 0x2d9dcd18988>, <Element span at 0x2d9dcd189c8>] <class 'list'> 3
# 找到li标签下的a标签下的span标签
span = html.xpath("//li/a/span")
print(span, type(span), len(span))
# [<Element span at 0x188548118c8>, <Element span at 0x18854811a08>] <class 'list'> 2

路径表达式

表达式	描述
nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性。

匹配属性

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

XPath运算符

运算符	描述	实例	返回值
\|	计算两个节点集	//book \| //cd	返回所有拥有 book 和 cd 元素的节点集
+	加法	6 + 4	10
–	减法	6 – 4	2
*	乘法	6 * 4	24
div	除法	8 div 4	2
=	等于	price=9.80	如果 price 是 9.80，则返回 true。如果 price 是 9.90，则返回 false。
!=	不等于	price!=9.80	如果 price 是 9.90，则返回 true。如果 price 是 9.80，则返回 false。
<	小于	price<9.80	如果 price 是 9.00，则返回 true。如果 price 是 9.90，则返回 false。
<=	小于或等于	price<=9.80	如果 price 是 9.00，则返回 true。如果 price 是 9.90，则返回 false。
>	大于	price>9.80	如果 price 是 9.90，则返回 true。如果 price 是 9.80，则返回 false。
>=	大于或等于	price>=9.80	如果 price 是 9.90，则返回 true。如果 price 是 9.70，则返回 false。
or	或	price=9.80 or price=9.70	如果 price 是 9.80，则返回 true。如果 price 是 9.50，则返回 false。
and	与	price>9.00 and price<9.90	如果 price 是 9.80，则返回 true。如果 price 是 8.50，则返回 false。
mod	计算除法的余数	5 mod 2	1

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　xpath文档

问题如何区别 a_second_2 = html.xpath("//li/a/text()")[1] a_second_1 = html.xpath("//li/a[1]/text()")

a_second_2 = html.xpath("//li/a/text()")[1]
a_second_1 = html.xpath("//li/a[1]/text()")
print(a_second_2, a_second_1)   # 第二个a标签 ['第一个a标签']

"""
可以看到a_second_2打印的是 第二个a标签
可以看到a_second_1打印的是 第一个a标签
xpath()方法返回的是一个列表类型
a_second_1表示找到li标签下第一个a标签的文本, 返回的是一个列表
a_second_2表示li标签下的a标签下的所有文本第二个
"""

"""
打印每个a标签的文本
html.xpath("//li/a[1]/text()")   html.xpath("//li/a[2]/text()")  html.xpath("//li/a[3]/text()")  没有list为空
['第一个a标签']                  ['第二个a标签']                  ['\n\t', '\n\t', '\n\t', '\n\t']
html.xpath("//li/a/text()")
['第一个a标签', '第二个a标签', '\n\t', '\n\t', '\n\t', '\n\t']
可以发现当a标签下有其它标签时会把\n\t字符也加入到列表中
"""

posted @ 2019-04-26 12:45 KIV 阅读(6986) 评论(0) 收藏举报

刷新页面返回顶部

KIV