1.13BeautifulSoup 剔除 HTML script 脚本;删除指定 class标签
BeautifulSoup 剔除 HTML script 脚本,删除指定 class标签
剔除 script
方式一:
[s.extract() for s in soup("script")]
方式二:
def H5_filter(self):
'''
对爬取的 H5 进行过滤
:return:
'''
page = self.crawl_succ_page()
soup = BeautifulSoup(page, 'lxml')
# 获取文本消息
title = soup.select('.rich_media_title')[0].get_text()
tags = soup.find_all()
for tag in tags:
if tag.name == 'script':
tag.decompose() # 剔除所有 script 脚本
filter_script_body = soup.find('body') # 只拿 body
article = soup.find('body').text
return filter_script_body, article, title
删除指定 class
for span in soup.find_all('span', {'class': 'weapp_display_element js_weapp_display_element'}): # 剔除指定 class
span.decompose()
如果要删除带有特定id的div,例如decompose(),则可以使用
soup.find('div', id="main-content").decompose()