网页数据的解析提取
lxml库安装:
pip install lxml
若报错,可能由于镜像源问题:
python -m pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple
#清华大学开源软件镜像站
使用xpath对网页进行解析实例:
#导入lxml库的etree模块
from lxml import etree
声明一段html文本
text="""
<div class="col nav-sub">
<ul id="python">
<li class="cat-item"><a href="/python3/python3-tutorial.html">Python3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/python/python-tutorial.html">Python2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
</ul>
<ul id="vue">
<li class="cat-item"><a href="/vue3/vue3-tutorial.html">Vue3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/vue/vue-tutorial.html">vue2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a>
</ul>
<ul id="bootstrap">
<li class="cat-item"><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap3 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap4/bootstrap4-tutorial.html">Bootstrap4 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap5/bootstrap5-tutorial.html">Bootstrap5 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a></li>
<li class="cat-item"><a href="/bootstrap/bootstrap-v2-tutorial.html">Bootstrap2 教程 <i class="fa fa-external-link" aria-hidden="true"></i></a>
</ul>
</div>
"""
html=etree.HTML(text)#调用HTML类进行初始化,text中有两个li节点未修复,自动修正HTML文本
re=etree.tostring(html)#调用tostring输出修正后HTML文本
e=re.decode("utf-8")#通过decode将bytes类型转str
打印结果为txt文本
with open("exe.txt","wb") as fs:
fs.write(re)
print("打印txt完毕!")
#打印结果为html文本
with open("exe.html","wb") as f:
f.write(re)
print("打印html完毕!")</code></pre>
将以上text文件存为a.html,则:
from lxml import etree
html=etree.parse("./a.html",etree.HTMLParser())
re=etree.tostring(html)
e=re.decode("utf-8")
with open("a.txt","wb") as fs:
fs.write(re)
print("打印txt完毕!")
结果如下:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div class="col nav-sub">
<ul id="python">
<li class="cat-item"><a href="/python3/python3-tutorial.html">Python3 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
<li class="cat-item"><a href="/python/python-tutorial.html">Python2 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
</ul>
<ul id="vue">
<li class="cat-item"><a href="/vue3/vue3-tutorial.html">Vue3 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
<li class="cat-item"><a href="/vue/vue-tutorial.html">vue2 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
</ul>
<ul id="bootstrap">
<li class="cat-item"><a href="/bootstrap/bootstrap-tutorial.html">Bootstrap3 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
<li class="cat-item"><a href="/bootstrap4/bootstrap4-tutorial.html">Bootstrap4 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
<li class="cat-item"><a href="/bootstrap5/bootstrap5-tutorial.html">Bootstrap5 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
<li class="cat-item"><a href="/bootstrap/bootstrap-v2-tutorial.html">Bootstrap2 教程 <i class="fa fa-external-link" aria-hidden="true"/></a></li>
</ul>
</div></body></html>