lxml解析html文件输出为dataframe

本地html文件分为表头节点<th>和表格内容节点<td>,父节点<tr>

import pandas as pd
from pandas.io.parsers import TextParser
from lxml.html import parse
from lxml import etree
htmlf = open("C:/Users/Administrator/Desktop/11/ho_relation_tdd-enm2.html", 'r', encoding="utf-8").read()
doc = etree.HTML(htmlf)
rows = doc.xpath('.//tr')
header = rows[0].xpath(".//th/text()")
data = [i.xpath(".//td/text()") for i in rows[1:]]
df = TextParser(data, names=header).get_chunk()

 

posted @ 2020-09-06 16:30  岁月饶过谁  阅读(384)  评论(0编辑  收藏  举报