lxml模块

lxml主要用xpath、css选择器等来提取xml格式文档,html也是xml格式文档的一种。

  • xpath方法返回列表的三种情况
    • 返回空列表:没有找到任何元素
    • 返回字符串列表:xpath规则匹配用了@属性或者text()等函数返回str(文本内容或某属性的值)
    • 返回由_Element 对象构成的列表:xpath规则匹配到标签(如li、span等),列表中的_Element对象可以继续调用xpath进一步获取元素。
from lxml import etree
from lxml.etree import _Element as Ele

if __name__ == '__main__':
    text = ''' 
    <div> 
    <ul> 
      <li class="item-1">
        <a href="link1.html">first item</a>
      </li> 
      <li class="item-1">
        <a href="link2.html">second item</a>
      </li> 
      <li class="item-inactive">
        <a href="link3.html">third item</a>
      </li> 
      <li class="item-1">
        <a href="link4.html">fourth item</a>
      </li> 
      <li class="item-1">
        a href="link5.html">fifth item</a>
    </ul> 
    </div>'''

    node: Ele = etree.HTML(text)
    
    info = dict()
    # 使用xpath提取出一个列表
    for item in node.xpath("//div/ul/li[@class='item-1']"):  # type: Ele
        if item is not None:
            try:
                name = item.xpath("./a/text()")[0]
                href = item.xpath("./a/@href")[0]
                info[name] = href
            except Exception as e:
                print(f"提取元素{item}出错,xpath语法:./a/text(), 元素标签名{item.tag}, 元素内容: {item.text}")
        else:
            print("item 为空")
    print(info)

lxml模块中etree.tostring函数的使用

from lxml import etree
html_str = ''' 
        <div> 
        <ul> 
          <li class="item-1"><a href="link1.html">first item</a></li> 
          <li class="item-1"><a href="link2.html">second item</a></li> 
          <li class="item-inactive"><a href="link3.html">third item</a></li> 
          <li class="item-1"><a href="link4.html">fourth item</a></li> 
          <li class="item-0"><a href="link5.html">fifth item</a> 
        </ul>
        </div> '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)


打印结果:
<html><body><div> <ul> 
<li class="item-1"><a href="link1.html">first item</a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> 
</li></ul> </div> </body></html>

结论

  • lxml.etree.HTML(html_str)可以自动补全缺失的标签(beautifulsoap也有这个功能)
  • lxml.etree.tostring函数可以将转换为_Element对象再转换回html字符串
  • 爬虫如果使用lxml来提取数据,应该以lxml.etree.tostring的返回结果作为提取数据的依据。
posted @ 2023-07-17 11:08  蕝戀  阅读(17)  评论(0编辑  收藏  举报