bs4

一个方便的网页解析库，处理高效，支持多种解析器。
主流的是Python标准库html.parser,一个是lxml解析器

# Python的标准库
BeautifulSoup(html, 'html.parser')

# lxml
BeautifulSoup(html, 'lxml')

内置标准库执行速度一般，在低版本的Python中，中文的容错能力比较差
lxml解析器执行速度快，需要装C语言依赖库

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')

soup.prettify() >>> 进行自动补全，将缺失代码补齐。

选择器

标准选择器
find_all(name, attrs, recursive, text, **kwargs)

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")   多个参数传入是一个传递关系  p标签下的title  css样式
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all('div', class_='top')
# 这里注意下，class是Python的内部关键词，我们需要在css属性class后面加一个下划线'_'，不然会报错。

soup.find_all("a", limit=2)
# [<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
# <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a>]

find( name , attrs , recursive , string , **kwargs )
与find_all的不同

soup.find_all('title', limit=1)
# [The Dormouse's story]

soup.find('title')
#The Dormouse's story

find_all返回的是一个列表，找不到目标返回空列表，
find直接返回结果，找不到目标返回None

CSS选择器

soup.select("title")
# [The Dormouse's story]

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]
soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]×

soup.select("body > a")
# []

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

提取标签内容

list = [<ahref="http://www.baidu.com/">百度</a>,
<ahref="http://www.163.com/">网易</a>,
<ahref="http://www.sina.com/"新浪</a>]

for i inlist:
print(i.get_text()) # 我们使用get_text()方法获得标签内容
print(i.get['href']# get['attrs']方法获得标签属性
print(i['href'])# 简写结果一样

百度
网易
新浪
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/

posted @ 2019-09-08 09:21 π=3.1415926 阅读(111) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

奔跑的咸鱼

灵光一闪的办法把线索画在一张纸上反复看放弃思考联系在一起灵光一闪

bs4

bs4

公告

奔跑的咸鱼

灵光一闪的办法 把线索画在一张纸上 反复看 放弃思考 联系在一起 灵光一闪

bs4

bs4

公告

灵光一闪的办法把线索画在一张纸上反复看放弃思考联系在一起灵光一闪