网页数据的解析提取

lxml库安装:

pip install lxml

若报错,可能由于镜像源问题:

python -m pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
#清华大学开源软件镜像站

使用xpath对网页进行解析实例:

#导入lxml库的etree模块
from lxml import etree

声明一段html文本

text="""
<div class="col nav-sub">
<ul id="python">
<li class="cat-item"><a href="/python3/python3-tutorial.html">Python3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/python/python-tutorial.html">Python2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
</ul>
<ul id="vue">
<li class="cat-item"><a href="/vue3/vue3-tutorial.html">Vue3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/vue/vue-tutorial.html">vue2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a>
</ul>
<ul id="bootstrap">
<li class="cat-item"><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap4/bootstrap4-tutorial.html">Bootstrap4 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap5/bootstrap5-tutorial.html">Bootstrap5 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap/bootstrap-v2-tutorial.html">Bootstrap2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a>
</ul>
</div>
"""
html=etree.HTML(text)#调用HTML类进行初始化,text中有两个li节点未修复,自动修正HTML文本
re=etree.tostring(html)#调用tostring输出修正后HTML文本
e=re.decode("utf-8")#通过decode将bytes类型转str

打印结果为txt文本

with open("exe.txt","wb") as fs:
fs.write(re)
print("打印txt完毕!")

#打印结果为html文本
with open("exe.html","wb") as f:
    f.write(re)
    print("打印html完毕!")</code></pre>

将以上text文件存为a.html,则:

from lxml import etree

html=etree.parse("./a.html",etree.HTMLParser())
re=etree.tostring(html)
e=re.decode("utf-8")
with open("a.txt","wb") as fs:
fs.write(re)
print("打印txt完毕!")

结果如下:


<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div class="col nav-sub">&#13;
	<ul id="python">&#13;
		<li class="cat-item"><a href="/python3/python3-tutorial.html">Python3 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
		<li class="cat-item"><a href="/python/python-tutorial.html">Python2 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
	</ul>&#13;
	<ul id="vue">&#13;
		<li class="cat-item"><a href="/vue3/vue3-tutorial.html">Vue3 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
		<li class="cat-item"><a href="/vue/vue-tutorial.html">vue2 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
	</ul>&#13;
	<ul id="bootstrap">&#13;
		<li class="cat-item"><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap3 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
		<li class="cat-item"><a href="/bootstrap4/bootstrap4-tutorial.html">Bootstrap4 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
		<li class="cat-item"><a href="/bootstrap5/bootstrap5-tutorial.html">Bootstrap5 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
		<li class="cat-item"><a href="/bootstrap/bootstrap-v2-tutorial.html">Bootstrap2 &#230;&#149;&#153;&#231;&#168;&#139; <i class="fa fa-external-link" aria-hidden="true"/></a></li>&#13;
	</ul>&#13;
</div></body></html>

 

posted on 2024-09-23 20:21  崇山主人  阅读(9)  评论(0编辑  收藏  举报

导航