lxml模块
lxml主要用xpath、css选择器等来提取xml格式文档,html也是xml格式文档的一种。
- xpath方法返回列表的三种情况
- 返回空列表:没有找到任何元素
- 返回字符串列表:xpath规则匹配用了
@属性
或者text()等函数
返回str
(文本内容或某属性的值) - 返回由_Element 对象构成的列表:xpath规则匹配到标签(如li、span等),列表中的_Element对象可以继续调用xpath进一步获取元素。
from lxml import etree
from lxml.etree import _Element as Ele
if __name__ == '__main__':
text = '''
<div>
<ul>
<li class="item-1">
<a href="link1.html">first item</a>
</li>
<li class="item-1">
<a href="link2.html">second item</a>
</li>
<li class="item-inactive">
<a href="link3.html">third item</a>
</li>
<li class="item-1">
<a href="link4.html">fourth item</a>
</li>
<li class="item-1">
a href="link5.html">fifth item</a>
</ul>
</div>'''
node: Ele = etree.HTML(text)
info = dict()
# 使用xpath提取出一个列表
for item in node.xpath("//div/ul/li[@class='item-1']"): # type: Ele
if item is not None:
try:
name = item.xpath("./a/text()")[0]
href = item.xpath("./a/@href")[0]
info[name] = href
except Exception as e:
print(f"提取元素{item}出错,xpath语法:./a/text(), 元素标签名{item.tag}, 元素内容: {item.text}")
else:
print("item 为空")
print(info)
lxml模块中etree.tostring函数的使用
from lxml import etree
html_str = '''
<div>
<ul>
<li class="item-1"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div> '''
html = etree.HTML(html_str)
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)
打印结果:
<html><body><div> <ul>
<li class="item-1"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul> </div> </body></html>
结论:
- lxml.etree.HTML(html_str)可以自动补全缺失的标签(beautifulsoap也有这个功能)
lxml.etree.tostring
函数可以将转换为_Element对象再转换回html字符串- 爬虫如果使用lxml来提取数据,应该以
lxml.etree.tostring
的返回结果作为提取数据的依据。
本文来自博客园,作者:蕝戀,转载请注明原文链接:https://www.cnblogs.com/juelian/p/17559500.html