html格式的文字去掉html tag转为纯text文字
使用lxml
import lxml.etree
import lxml.html
with open('/tmp/hzh/a.html', 'r') as file:
data = file.read()
html_str = '<p>hzh。<div>ddiivv</div></p> \n <p> l1</p>'
root = lxml.html.fromstring(html_str)
# optionally remove tags that are not usually rendered in browsers
# javascript, HTML/HEAD, comments, add the tag names you dont want at the end
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")
# complete text. Remove tags and convert to string.
result_str = lxml.html.tostring(root, method="text", encoding='unicode')
print(result_str)
如果想细粒度控制,则可以用
html_str = '<p>hzh。<div>ddiivv</div></p> \n <p> l1</p>'
root = lxml.html.fromstring(html_str)
print(lxml.etree.tostring(root, pretty_print=True, encoding='unicode'))
# <div>ddiivv</div> 去掉,会去掉tag里面的内容
lxml.etree.strip_elements(root, 'div', with_tail=False) # result is: hzh。 \n l1
root = lxml.html.fromstring(html_str)
# 去掉 div tag,保留tag里面的内容
lxml.etree.strip_tags(root, 'div')
# 最外面有个div是因为你创建的时候有两个p tag并列在,新建一个div成为他们的root,每个element必须有个root。
print(lxml.etree.tostring(root, pretty_print=True, encoding='unicode')) #result is: <div><p>hz<br/>h。</p>ddiivv \n <p> l1</p></div>
使用xpath的string()格式
如果使用webdriver就更简单了
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
# 初始化WebDriver(这里使用的是Chrome,你也可以选择其他的如Firefox等)
chrome_options = Options()
chrome_options.add_argument("start-maximized")
chrome_service = ChromeService(executable_path='/home/hzh/disk2/dl/chromedriver')
driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
# 打开网址
driver.get("https://www.cnblogs.com/welhzh/p/17272452.html")
# 获取元素
element = driver.find_element(By.ID, "cnblogs_post_body")
# 获取纯文本内容
text_content = element.text
print(text_content)
# 关闭浏览器
driver.quit()
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
支付宝扫一扫捐赠
支付宝扫一扫捐赠

微信公众号: 共鸣圈
欢迎讨论,邮件: 924948$qq.com 请把$改成@
QQ群:263132197
QQ: 924948
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具
2017-04-07 App doesn't auto-start an app when booting the device in Android
2015-04-07 17 Great Machine Learning Libraries
2015-04-07 Using Live555 to Stream Live Video from an IP camera connected to an H264 encoder
2015-04-07 How to effectively work with multiple files in Vim?