[译]使用BeautifulSoup和Python从网页中提取文本

如果您要花时间浏览网页,您可能遇到的一项任务就是从HTML中删除可见的文本内容。
如果您使用的是Python,我们可以使用BeautifulSoup来完成此任务。

设置提取

首先,我们需要获取一些HTML。我将使用Troy Hunt最近关于“Collection#1”Data Breach的博客文章。
以下是您下载HTML的方法:

import requests
url = 'https: //www.troyhunt.com/the-773-million-record-collection-1-data-reach/'res = 
requests.get(url)
html_page = res.content

现在,我们有了HTML ..但是那里会有很多混乱。我们如何提取我们想要的信息?

创建 beautiful soup

我们将使用Beautiful Soup来解析HTML,如下所示:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')

找到文字

BeautifulSoup提供了一种从HTML中查找文本内容(即非HTML)的简单方法:

text = soup.find_all(text=True)

但是,这将为我们提供一些我们不想要的信息。
查看以下语句的输出:

set([t.parent.name for t in text])
# {'label', 'h4', 'ol', '[document]', 'a', 'h1', 'noscript', 'span', 'header', 'ul', 'html', 'section', 'article', 'em', 'meta', 'title', 'body', 'aside', 'footer', 'div', 'form', 'nav', 'p', 'head', 'link', 'strong', 'h6', 'br', 'li', 'h3', 'h5', 'input', 'blockquote', 'main', 'script', 'figure'}

这里有一些我们可能不想要的项目:

[document]

  • noscript
  • header
  • html
  • meta
  • head
  • input
  • script

对于其他人,您应该检查以查看您想要的。

提取有价值的文字

现在我们可以看到我们的宝贵元素,我们可以构建我们的输出:

output = ''
blacklist = [
	'[document]',
	'noscript',
	'header',
	'html',
	'meta',
	'head', 
	'input',
	'script',
	# there may be more elements you don't want, such as "style", etc.
]

for t in text:
	if t.parent.name not in blacklist:
		output += '{} '.format(t)

完整的脚本

最后,这是从网页获取文本的完整Python脚本:

import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
	'[document]',
	'noscript',
	'header',
	'html',
	'meta',
	'head', 
	'input',
	'script',
	# there may be more elements you don't want, such as "style", etc.
]

for t in text:
	if t.parent.name not in blacklist:
		output += '{} '.format(t)

print(output)

改进

如果你output现在看,你会发现我们有一些我们不想要的东西。

标题中有一些文字:

Home \n \n \n Workshops \n \n \n Speaking \n \n \n Media \n \n \n About \n \n \n Contact \n \n \n Sponsor \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n   \n \n \n \n Sponsored by:

还有一些来自页脚的文字:

\n \n \n \n \n \n Weekly Update 122 \n \n \n \n \n Weekly Update 121 \n \n \n \n \n \n \n \n Subscribe  \n \n \n \n \n \n \n \n \n \n Subscribe Now! \n \n \n \n \r\n            Send new blog posts: \n   daily \n   weekly \n \n \n \n Hey, just quickly confirm you\'re not a robot: \n  Submitting... \n Got it! Check your email, click the confirmation link I just sent you and we\'re done. \n \n \n \n \n \n \n \n Copyright 2019, Troy Hunt \n This work is licensed under a  Creative Commons Attribution 4.0 International License . In other words, share generously but provide attribution. \n \n \n Disclaimer \n Opinions expressed here are my own and may not reflect those of people I work with, my mates, my wife, the kids etc. Unless I\'m quoting someone, they\'re just my own views. \n \n \n Published with Ghost \n This site runs entirely on  Ghost  and is made possible thanks to their kind support. Read more about  why I chose to use Ghost . \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n   \n \n \n \n \n '

如果您只是从单个站点提取文本,您可以查看HTML并找到一种方法来仅从页面中解析出有价值的内容。
不幸的是,互联网是一个混乱的地方,你很难在HTML语义上找到共识。
祝好运!

原文来源:https://matix.io/extract-text-from-webpage-using-beautifulsoup-and-python/

posted @ 2019-07-18 11:09  bingo彬哥  阅读(9951)  评论(0编辑  收藏  举报
本站总访问量