lxml.etree.HTML(text) 解析HTML文档
0.参考
http://lxml.de/tutorial.html#the-xml-function
There is also a corresponding function HTML() for HTML literals.
>>> root = etree.HTML("<p>data</p>")
>>> etree.tostring(root)
b'<html><body><p>data</p></body></html>'
1.基本用法
from lxml import etree # Parses an HTML document from a string constant. Returns the root nood root = etree.HTML(r.text) #<Element html at 0x7bb8208>
1.1 xpath 和 cssselect 获取文字和属性
In [83]: for item in root.xpath('//button')[:1]: ...: print(item) ...: print(item.text) #获取文字 ...: print(item.xpath('./@id')) ...: <Element button at 0x84277c8> Requests Generator ['btn_requests'] ### In [84]: for item in root.cssselect('button')[:1]: ...: print(item) ...: print(item.text) ...: print(item.cssselect('::attr(id)')) #不支持伪元素写法 ...: ...: <Element button at 0x84277c8> Requests Generator ExpressionError: Pseudo-elements are not supported. ### In [92]: for item in root.cssselect('button')[:1]: ...: print(item.get('id', '')) #获取属性 btn_requests ### In [93]: for item in root.cssselect('button')[:1]: ...: print(item.xpath('./@id')) #嵌套 ...: ['btn_requests']
1.2 美化打印
print(etree.tostring(root, pretty_print=True).decode('utf-8')) # 美化打印 # You can also serialise to a Unicode string without declaration by # passing the ``unicode`` function as encoding (or ``str`` in Py3), # or the name 'unicode'. This changes the return value from a byte # string to an unencoded unicode string. print(etree.tostring(root, encoding=str, pretty_print=True)) #py3 使之返回 text print(etree.tostring(root, encoding=unicode, pretty_print=True)) #py2 使之返回 unicode
1.3 自动补全
In [109]: rt = etree.HTML('<html><p>123</p></html>') #自动补全 In [110]: print(etree.tostring(rt, encoding=str, pretty_print=True)) <html> <body> <p>123</p> </body> </html>
1.4 fromstring 不支持残缺片段,不会自动补全
In [115]: rt = etree.fromstring('<html><p>456</html>') #fromstring 不支持残缺片段,不会自动补全 XMLSyntaxError: Opening and ending tag mismatch: p line 1 and html, line 1, column 20 In [116]: rt = etree.fromstring('<html><p>456</p></html>') In [117]: print(etree.tostring(rt, encoding=str, pretty_print=True)) <html> <p>456</p> </html>
.