beautifulsoup4

官方文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。

参数

import requests  #导入requests模块
response = requests.get("https://www.autohome.com.cn/all/1/#liststart") #get发送请求
all_soup = BeautifulSoup(response.text, "html.parser")    #html.parser解析器
ul_obj = all_soup.find(name="ul", attrs={"class": 'article'})  #找到ul标签

response.encoding = 'gbk'  #指定编码格式
print(response.encodint)   #获取网页编码格式
print(response.status_code)   #200   #打印响应状态码
print(response.request)   #<PreparedRequest [GET]>  #获取请求方式
print(response.url)   #打印网络路由地址
print(response.text)   #获取html页面文本内容的数据
print(request.session())  #获取携带的session数据
print(request.cookies)    #requests_cookies <module 'requests.cookies' from 		'E:\\Python\\Python368\\lib\\site-packages\\requests\\cookies.py'>

print(response.cookies)    #<RequestsCookieJar[]>获取携带的cookies值
print(response.content)  #获取二进制类数据类型   如imp4  imp3  jpg  img等

搜索

BeautifulSoup 主要用来遍历子节点及子节点的属性，并提供了很多方法，比如获取子节点、父节点、兄弟节点等，但通过实践来看，这些方法用到的并不多。我们主要用到的是从文档树中搜索出我们的目标。

通过点取属性的方式只能获得当前文档中的第一个 tag，例如，soup.li。如果想要得到所有的<li> 标签，就需要用到 find_all()，find_all() 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件 find_all() 所接受的参数如下：

find_all( name , attrs , recursive , text , **kwargs )

按name搜索

可以查找所有名字为 name 的 tag，字符串对象会被自动忽略掉。

>>> soup.find_all('b')
[<b>The Dormouse's story</b>]
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

按 id 搜索

如果文档树中包含一个名字为 id 的参数，其实在搜索时会把该参数当作指定名字 tag 的属性来搜索:

>>> soup.find_all(id='link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

按 attr 搜索

有些 tag 属性在搜索不能使用，比如 HTML5 中的 data-* 属性，但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的 tag。其实 id 也是一个 attr：

>>> soup.find_all(attrs={'id':'link1'})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

按 CSS 搜索

按照 CSS 类名搜索 tag 的功能非常实用，但标识 CSS 类名的关键字 class 在 Python 中是保留字，使用 class 做参数会导致语法错误。因此从 Beautiful Soup 的 4.1.1 版本开始，可以通过 class_ 参数搜索有指定 CSS 类名的 tag:

>>> soup.find_all(class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

string 参数

通过 string 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样，string 参数接受字符串、正则表达式、列表、True。

>>> soup.find_all('a', string='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

recursive 参数

调用 tag 的 find_all() 方法时，Beautiful Soup 会检索当前 tag 的所有子孙节点，如果只想搜索 tag 的直接子节点，可以使用参数 recursive=False。

find() 方法

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法只返回第一个匹配的结果

get_text() 方法

如果只想得到 tag 中包含的文本内容，那么可以用 get_text() 方法，这个方法获取到 tag 中包含的所有文本内容。

>>> soup.find_all('a', string='Elsie')[0].get_text()
'Elsie'
>>> soup.find_all('a', string='Elsie')[0].string
'Elsie'

posted @ 2020-01-08 15:15 Na_years 阅读(226) 评论(0) 编辑收藏举报

刷新页面返回顶部

Na_years

beautifulsoup4

beautifulsoup4

官方文档

参数

搜索

按name搜索

按 id 搜索

按 attr 搜索

按 CSS 搜索

string 参数

recursive 参数

find() 方法

get_text() 方法

公告