Python模块之 lxml 是一个 xpath 格式解析模块

作用：

　　是一个 xpath 格式解析模块

必要操作：

安装：

pip install lxml 
或
easy_install lxml

导入包：

from lxml import etree

帮助查看：

help(etree)

方法（函数）:

1.解析离线网页：

1 html=etree.parse('xx.html',etree.HTMLParser())
2 aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/@href')
3 print(aa)

2.解析在线网页：

1 from lxml import etree
2 import requests
3 rep=requests.get('https://www.baidu.com')
4 html=etree.HTML(rep.text)
5 aa=html.xpath('//*[@id="s_xmancard_news"]/div/div[2]/div/div[1]/h2/a[1]/@href')
6 print(aa)

解析规则：

#使用text构造一个XPath解析对象,etree模块可以自动修正HTML文本
html = lxml.etree.HTML(text)

#直接读取文本进行解析
html = lxml.etree.parse('./ex.html',etree.HTMLParser())

#选取所有节点
result = html.xpath('//*')

#获取所有li节点
result = html.xpath('//li')

#获取所有li节点的直接a子节点
result = html.xpath('//li/a')

#获取所有li节点的所有a子孙节点
result = html.xpath('//li//a')

#获取所有href属性为link.html的a节点的父节点的class属性
result = html.xpath('//a[@href="link.html"]/../@class')

#获取所有class属性为ni的li节点
result = html.xpath('//li[@class="ni"]')

#获取所有li节点的文本
result = html.xpath('//li/text()')

#获取所有li节点的a节点的href属性
result = html.xpath('//li/a/@href')

#当li的class属性有多个值时，需用contains函数完成匹配
result = html.xpath('//li[contains(@class,"li")]/a/text())

#多属性匹配
result = html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
result = html.xpath('//li[1]/a/text()')
result = html.xpath('//li[last()]/a/text()')
result = html.xpath('//li[position()<3]/a/text()')
result = html.xpath('//li[last()-2]/a/text()')

#按序选择，中括号内为XPath提供的函数
result = html.xpath('//li[1]/ancestor::*')
#获取祖先节点
result = html.xpath('//li[1]/ancestor::div')

#获取属性值
result = html.xpath('//li[1]/attribute::*')

#获取直接子节点
result = html.xpath('//li[1]/child::a[@href="link1.html"]')

#获取所有子孙节点
result = html.xpath('//li[1]/descendant::span')

#获取当前节点之后的所有节点的第二个
result = html.xpath('//li[1]/following::*[2]')

#获取后续所有同级节点
result = html.xpath('//li[1]/following-sibling::*')

---
相关文章：
Python安装包下载：https://www.cnblogs.com/wutou/p/17709685.html
Pip 源设置：https://www.cnblogs.com/wutou/p/17531296.html
pip 安装指定版本模块：https://www.cnblogs.com/wutou/p/17716203.html
【汇总】Python模块 - 总目录 https://www.cnblogs.com/wutou/p/15610071.html

参考：

https://zhuanlan.zhihu.com/p/356791704

posted @ 2023-02-08 15:27 悟透阅读(112) 评论(0) 收藏举报

刷新页面返回顶部

Python模块之 lxml 是一个 xpath 格式解析模块

公告