lxml

lxml

简述：lxml是一个python库，用来处理xml和html文件，还可以用来web爬取数据

安装：

使用命令：pip install lxml

MacOS或者Linux系统：sudo apt-get install python-lxml

以上不行则尝试使用：easy_install lxml

使用：

首先需要导入模块

#从lxml库中导入etree模块
from lxml import etree

html解析：

etree.HTML方法可以把html文本内容解析成html对象，并可以自动修正格式

#接下来示范使用etree.html函数来自动修正html文本格式

from lxml import etree

#这是一串格式不完整的html源码
htmldemo = '''
<meta charset="UTF-8"> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<html><head><title>yoyo ketang</title></head>
<body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
'''
#使用etree.HTML方法解析，自动修正格式
complete = etree.HTML(htmldemo)
print (etree.tostring(complete,pretty_print=True).decode('utf-8'))
#打印结果
'''
<html>
  <head><meta charset="UTF-8"/> <!-- for HTML5 -->
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>yoyo ketang</title>
</head>
  <body>
<b><!--Hey, this in comment!--></b>
<p class="title"><b>yoyoketang</b></p>
</body>
</html>
'''

创建html/xml文档

可以使用etree模块中的Element方法创建元素

#首先创建根元素
root = etree.Element('html',vertion="5.0")

#在根元素下建子元素
etree.SubElement(root,'head')
etree.SubElement(root,'title',bgcolor='red',fontsize='22')
etree.SubElement(root,'body',fontsize='15')

#子元素创建元素，在body标签中建p标签，然后在p标签中建a标签
etree.SubElement(root[2],'p',bgcolor="red")
etree.SubElement(root[2][0],'a')

#打印结果，pretty_print参数为True表示以html标准格式输出
print (etree.tostring(root,pretty_print=True).decode('utf-8'))

解析HTML、XML文档，添加元素、属性，获取信息

前面是创建元素以及添加属性，如果我们想从一个已经创建好的html、xml文件中解析它提取内容，假设上面我们使用Element方法创建好了一个html文档，则可以

#解析一段html文本
lxml.etree.html(root)　　#数据根
#解析一段xml文本
lxml.etree.xml(root)
#解析html/xml文档,使用parse()函数
lxml.etree.parse(filepath,parser)　　#文件名，解析器对象
在使用parse()函数报错时可能时因为没设置解析器对象
parser=etree.HTMLParser(encoding="utf-8")

html=etree.parse(‘flower.html‘,parser=parser)

#遍历根元素下的子元素，并且打印出标签
for t in root:
        #打印标签
    print (t.tag)

#给元素添加属性
root.set('newAttribute','attributeValue')
#获取元素的属性值
#方法一：使用get函数获取属性值
root.get('newAttribute')
root[1].get('bgcolor')
#方法二：attrib属性将其创建成一个字典对象,通过字典key值索引
attributes = root.attrib
print (attributes['newAttribute'])
#元素下添加文本信息
root[0].text = "this is the head"
root[1].text = "this is the title"
root[2][0].text = "this is the subtag p the body"
root[2][0][0].text = "this is the subtag a the p"

#检查元素的某个节点是否是一个元素
etree.iselemen(root[o])
#检查元素是否有父元素
root.getparent()
root[0].getparent()
#检查同胞元素
root.getnext()
root[0].getnext() #输出：title，返回的是本身的下一个同胞元素
root[1].getprevious() #输出：head,返回的是本身的上一个同胞元素
#寻找元素
root.find('head'.tag)

总结：

以上写到了，lxml的作用，以及使用它来处理html、xml文档来创建元素，添加属性，并且还可以解析已经创建好的html、xml文档来获取内容

参考：

https://python.freelycode.com/contribution/detail/1532

posted @ 2021-04-27 17:11 _cai 阅读(518) 评论(0) 编辑收藏举报

刷新页面返回顶部

_cai

lxml

公告