61.BeautifulSoup模块

BeautifulSoup模块

【一】初识

1）介绍

Beautiful Soup是python的一个库
最主要的功能是从网页抓取数据。
官方文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

# 安装
pip install BeautifulSoup4
# 导入
from bs4 import BeautifulSoup

2）HTML解析器

解析当前页面生成的dom对象

内置解析器：html.parser

# 语法
soup=BeautifulSoup(页面源码,'html.parser')

第三方解析器：lxml

# 安装
pip install lxml
# 语法
soup=BeautifulSoup(页面源码,'lxml')

第三方解析器：html5lib

# 安装
pip install html5lib
# 语法
soup=BeautifulSoup(页面源码,'html5lib')

3）三种解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) 或BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

4）示例生成soup对象

from bs4 import BeautifulSoup

#模拟网页获取的源码
with open("./test.html", "r", encoding="utf-8") as f:
    data = f.read()
# 生成一个soup解析器对象
soup = soup=BeautifulSoup(data,'lxml')
# 获取所有文本内容
text = soup.get_text()
print(text)

【二】四种对象

1）介绍

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，并将每个节点都表示为Python对象
在Beautiful Soup中，有四种主要的对象类型：
- BeautifulSoup
- Tag
- NavigableString
- Comment

2）BeautifulSoup对象

代表整个解析后的HTML文档，是最顶层的对象
其包含了整个文档的全部内容，并提供了操作HTML文档的方法和属性

soup = BeautifulSoup(页面源码,'html.parser')

3）Tag对象

表示HTML中的标签，如<p>,<div>等
其包含了标签的名称和对应的属性，并可以通过Tag对象来获取标签内的内容或进行进一步的操作
可以通过传递HTML文档给BeautifulSoup类初始的方式创建Tag对象

0.示例文档模型

<html><head>
    <title>The Dormouse's story</title>
</head><body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>

<p class="story">...</p>
</body>
</html>

1.查找tag对象

# 输出head标签，类型
print(soup.head, type(soup.head))
# <head><title>The Dormouse's story</title></head>
# <class 'bs4.element.Tag'>

# 输出title标签及其类型
print(soup.title, type(soup.title))
# <title>The Dormouse's story</title>
# <class 'bs4.element.Tag'>

# 输出第一个a标签及其类型
print(soup.a, type(soup.a))
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <class 'bs4.element.Tag'>

# 输出第一个p标签下的第一个b标签
print(soup.p.b)
# <b>The Dormouse's story</b>

2.查找tag对象的标签、属性

# 输出第一个a标签的name属性值
print(soup.a.name)
# a

# 输出第一个p标签下的第一个b标签的name属性值
print(soup.p.b.name)
# b

# 输出第一个a标签的href属性值
print(soup.a["href"])
# http://example.com/elsie

# 输出第一个a标签的id属性值
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

# 返回当前标签的 class 属性值
print(soup.a["class"])
# ['sister']

3.修改tag的属性

# 修改第一个a标签的class属性值
soup.a["class"] = ["sister c1"]
# ['sister']	--->	['sister c1']

# 删除第一个a标签的id属性值
del soup.a["id"]
print(soup)
# 第一个a标签的id值被删除

4.获取标签对象的文本内容

# 直接使用标签对象获取中间的文本内容
print(soup.p.get_text())
# The Dormouse's story

# 获取生成器对象，在遍历
print([i for i in soup.p.strings])
# <generator object Tag._all_strings at 0x00000190859A7E40>
# The Dormouse's story

# 获取当前标签的文本
print(soup.p.b.string)
# The Dormouse's story
# 若是多层级，string不生效，返回None

4）NavigableString对象

表示标签内的文本内容，即非标签字符串
当tag只包含单一的字符串时，可用tag.string、tag.text来获取该字符串

# 获取p标签的文本内容
print(soup.p.string)
# The Dormouse's story

# 获取p标签下所有的文本内容
print(soup.p.strings)
# <generator object Tag._all_strings at 0x102b7b300>
for i in soup.p.strings:
    print(i)
# The Dormouse's story

5）Comment对象

表示HTML文档的注释内容
遇到HTML文档中的注释时，将其注释封装成Comment对象

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, "html.parser")

# 输出注释内容
comment = soup.b.string
print(comment,type(comment))
# Hey, buddy. Want to buy a used parser?
# <class 'bs4.element.Comment'>

【三】文档树操作

1）概念

遍历文档树，也被称为导航文档树，是指在一个文档对象模型（DOM）中按照特定的方法和规则来遍历和浏览其中的节点。
DOM是一种处理XML或HTML文档的标准编程接口，它将文档解析成由节点和对象组成的树状结构。
在遍历文档树的过程中，可以通过访问当前节点及其相关属性、子节点、父节点、兄弟节点等信息，来对文档进行操作和分析

2）常见的文档树遍历算法

选择起始节点：
- 首先需要确定遍历的起始节点，可以是整个文档的根节点，也可以是某个指定的节点。
访问当前节点：
- 从起始节点开始，首先访问当前节点，可以获取当前节点的标签名、属性、文本内容等信息。
处理当前节点：
- 根据需要，对当前节点进行一些处理操作，比如判断节点类型、执行特定的任务等。
遍历子节点：
- 如果当前节点有子节点，将从第一个子节点开始递归遍历，重复步骤2和步骤3。
遍历兄弟节点：
- 如果当前节点没有子节点或者子节点已经遍历完毕，将继续遍历当前节点的下一个兄弟节点，重复步骤2和步骤3。
返回父节点：
- 当遍历到某个节点的兄弟节点都被遍历完毕后，返回到该节点的父节点，并继续遍历父节点的下一个兄弟节点。
结束条件：
- 当整个文档树的节点都被遍历完毕，或者满足某个结束条件时，结束遍历过程。

3）语法

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

1.获取标签的名称

# 使用`tag.name`属性可以获取当前标签的名称
print(soup.p.name)
# p

2.获取标签的属性

# 使用`tag.attrs`属性可以获取当前标签的属性字典。
print(soup.p.attrs)
# {'class': ['title'], 'name': 'first_p'}

3.获取标签的内容

tag.string

# 使用`tag.string`属性可以获取当前标签内的文本内容。
# 如果标签内只有一个字符串，可以直接使用该属性获取内容。
print(soup.p.string)
# The Dormouse's story

tag.strings

# 使用`tag.strings`方法可以获取当前标签内所有子节点的文本内容，返回一个生成器对象。
print(soup.p.strings)
# <generator object Tag._all_strings at 0x120eff370>
print(list(soup.p.strings))
# ["The Dormouse's story"]

tag.text

# 使用`tag.text`属性可以获取当前标签内所有子节点的文本内容，并将其连接在一起。
print(soup.p.text)
# The Dormouse's story

tag.stripped_strings

# 使用`tag.stripped_strings`方法可以获取当前标签内所有子节点的文本内容，并去掉多余的空白字符。
for line in soup.stripped_strings:
    print(line)
# Once upon a time there were three little sisters; and their names were
# Elsie
# ,
# Lacie
# and
# Tillie
# ;
# and they lived at the bottom of a well.
# ...

4.嵌套选择

print(soup.head.title.text)  
# 输出：The Dormouse's story

print(soup.body.a.text)  
# 输出：Elsie

5.子节点、子孙节点

# p下所有子节点
print(soup.p.contents)
# [<b>The Dormouse's story</b>]

# 得到一个迭代器，包含p下所有子节点
print(soup.p.children)
# <list_iterator object at 0x123583fa0>
for i, child in enumerate(soup.p.children, 1):
    print(i, child)
# 1 <b>The Dormouse's story</b>

# 获取子孙节点，p下所有的标签都会被选择出来
print(soup.p.descendants)
# <generator object Tag.descendants at 0x12327f300>
for i, child in enumerate(soup.p.descendants, 1):
    print(i, child)
# 1 <b>The Dormouse's story</b>
# 2 The Dormouse's story

# 针对第二个p标签的子孙节点进行遍历
for i, child in enumerate(soup.find_all("p")[1].descendants, 1):
    print(i, child)
# 1 Once upon a time there were three little sisters; and their names were
# 2 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 3 Elsie
# 4 ,
# 5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 6 Lacie
# 7  and
# 8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 9 Tillie
# 10 ;
# and they lived at the bottom of a well.

6.父节点、祖先节点

# 获取a标签的父节点
print(soup.a.parent)

# 获取a标签的父节点的文本内容
print(soup.a.parent.text)

# 找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
print(soup.a.parents)

7.兄弟节点

print(soup.a.next_sibling)  
# 输出：<class 'bs4.element.NavigableString'>

print(soup.a.next_sibling.next_sibling) 
#下一个兄弟

print(soup.a.previous_sibling.previous_sibling) 
#上一个兄弟

print(list(soup.a.next_siblings)) 
#下面的兄弟们=>生成器对象

print(soup.a.previous_siblings)  
# 输出：生成器对象，包含上面的兄弟节点
# 上面的兄弟们=>生成器对象

【四】搜索文档树

1）介绍

recursive 是否从当前位置递归往下查询，如果不递归，只会查询当前soup文档的子元素
string 这里是通过tag的内容来搜索，并且返回的是类容，而不是tag类型的元素
kwargs 自动拆包接受属性值，所以才会有soup.find_all('a',id='title') ，id='title'为kwargs自动拆包掺入
BeautifulSoup定义了很多搜索方法,这里着重介绍2个:
- find() 和 find_all()
- 其它方法的参数和用法类似

2）语法

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="first_p"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

1.查找所有语法

根据标签名查找

# 根据标签名查找到所有符合当前标签名的标签
print(soup.find_all("a"))
# 根据标签名和属性名确定唯一的那个标签
print(soup.find_all("a", attrs={"class": "sister"}))

根据正则表达式查找

print(soup.find_all(name=re.compile("^b")))

可以放列表

print(soup.find_all(name=["a", 'b']))

放函数方法

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(name=has_class_but_no_id))
# 只返回具有 class 属性而没有 id 属性的 标签

True
- 通过find_all(True)可以匹配所有的tag，不会返回字符串节点。
- 在代码中，会使用循环打印出每个匹配到的tag的名称(tag.name)

# ● 通过find_all(True)可以匹配所有的tag，不会返回字符串节点。
# ● 在代码中，会使用循环打印出每个匹配到的tag的名称(tag.name)。
print(soup.find_all(name=True))

keyword 参数
- keyword 参数用于按照属性值进行搜索

#指定属性值：
print(soup.find_all(href="http://example.com/elsie"))
# 返回所有 href 属性等于 "http://example.com/tillie" 的标签。
# 正则表达式匹配属性值：
soup.find_all(href=re.compile("^http://")) 
# 返回所有 href 属性以 "http://" 开头的标签。
# 多个属性：
soup.find_all(href=re.compile("http://"), id='link1') 
# 返回同时满足 href 以 "http://" 开头并且 id 等于 "link1" 的标签。

text参数
- text 参数用于根据内容搜索标签。可以接受字符串、列表或正则表达式

# 字符串：
soup.find_all(text="Elsie") 
#返回所有包含文本 "Elsie" 的标签。

# 列表：
soup.find_all(text=["Tillie", "Elsie", "Lacie"]) 
# 返回所有包含文本 "Tillie"、"Elsie" 或 "Lacie" 的标签。

# 正则表达式：
soup.find_all(text=re.compile("Dormouse")) 
# 返回所有包含文本中包含 "Dormouse" 的标签。

limit参数
- find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.
- 如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量
- 效果与SQL中的limit关键字类似
- 当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

print(soup.find_all("a", limit=2))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive 参数
- recursive 参数用于控制是否递归往下查询。
- 默认情况下，Beautiful Soup会检索当前tag的所有子孙节点。
- 如果想要仅搜索tag的直接子节点，可以设置 recursive=False。

print(soup.find_all("p", recursive=False))

2.查找单个find

find( name , attrs , recursive , string , **kwargs )

name: 指定要查找的tag名称，可以是字符串或正则表达式。
attrs: 指定tag的属性，可以是字典或字典的列表。
recursive: 指定是否递归查找子孙tag，默认为True。
string: 指定查找的文本内容，可以是字符串或正则表达式

3.find_all() 和find()比较

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

find_all() 方法的返回结果是值包含一个元素的列表
find() 方法直接返回结果

4.标签名多次调用

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

5.拓展

find_parents() 和 find_parent()
- find_parents():
- - 返回所有符合条件的父级tag，结果是一个生成器。
  - 可以传入参数来进一步筛选父级tag。
- find_parent():
- - 返回第一个符合条件的父级tag。
find_next_siblings() 和 find_next_sibling()
- ind_next_siblings():
- - 返回所有符合条件的后续兄弟tag，结果是一个列表。
  - 可以传入参数来进一步筛选兄弟tag。
- find_next_sibling():
- - 返回第一个符合条件的后续兄弟tag。
find_all_next() 和 find_next()
- find_all_next():
- - 返回所有符合条件的后续tag和文本内容，结果是一个生成器。
  - 可以传入参数来进一步筛选结果。
- find_next():
- - 返回第一个符合条件的后续tag或文本内容。

【五】CSS选择器

1）官网

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37

2）select

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
    <b>The Dormouse's story</b>
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elsie</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    <div class='panel-1'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'><h1 class='yyyy'>Foo</h1></li>
            <li class='element xxx'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
    </div>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

# 选取title元素
soup.select("title")  
'''
[<title>The Dormouse's story</title>]
'''

# 选取第三个p元素（class为story）
soup.select("p:nth-of-type(3)")  
'''
[<p class="story">...</p>]
'''

# 选取body下的所有a元素
soup.select("body a")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 选取html head title元素
soup.select("html head title")  
'''
[<title>The Dormouse's story</title>]
'''

# 选取head下直接子元素title
soup.select("head > title")  
'''
[<title>The Dormouse's story</title>]
'''

# 返回所有<p>标签下的直接子级<a>标签
soup.select("p > a")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

soup.select("p > a:nth-of-type(2)")
# 返回所有<p>标签下第二个<a>标签（直接子级）
'''
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
'''

soup.select("p > #link1")
# 返回所有<p>标签下拥有id="link1"的<a>标签（直接子级）
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
'''

# 返回所有<body>标签下的直接子级<a>标签
soup.select("body > a")

# 返回拥有id="link1"的<a>标签之后所有同级的class="sister"的<a>标签
soup.select("#link1 ~ .sister")
'''
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]
'''

# 返回拥有id="link1"的<a>标签之后紧邻的下一个同级的class="sister"的<a>标签
soup.select("#link1 + .sister")
'''
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
'''

# 返回所有class="sister"的<a>标签
soup.select(".sister")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 返回所有class属性中包含“sister”的<a>标签
soup.select("[class~=sister]")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 返回拥有id="link1"的<a>标签
soup.select("#link1")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
'''

# 返回拥有id="link2"且为<a>标签的元素
soup.select("a#link2")
'''
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
'''

# 返回拥有id="link1"或id="link2"的<a>标签
soup.select("#link1,#link2")
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
'''

# 选取包含 href 属性的所有 <a> 标签
soup.select('a[href]')
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 选取 href 属性等于 "http://example.com/elsie" 的 <a> 标签
soup.select('a[href="http://example.com/elsie"]')
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
'''

# 选取 href 属性以 "http://example.com/" 开头的 <a> 标签
soup.select('a[href^="http://example.com/"]')
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 选取 href 属性以 "tillie" 结尾的 <a> 标签
soup.select('a[href$="tillie"]')
'''
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
'''

# 选取 href 属性中包含 ".com/el" 的 <a> 标签
soup.select('a[href*=".com/el"]')
'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
'''


multilingual_markup = """
 <p lang="en">Hello</p>
 <p lang="en-us">Howdy, y'all</p>
 <p lang="en-gb">Pip-pip, old fruit</p>
 <p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)

# 选取 lang 属性以 "en" 开头的 <p> 标签
multilingual_soup.select('p[lang|=en]')
'''
[<p lang="en">Hello</p>,
<p lang="en-us">Howdy, y'all</p>,
<p lang="en-gb">Pip-pip, old fruit</p>]
'''

3）select_one

返回查找到的元素的第一个

# 获取属性值
print(soup.select('#list-2 h1')[0].attrs)
# 获取内容
print(soup.select('#list-2 h1')[0].get_text())

【六】案例

# 豆掰 Top250
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup


class SpiderTop250(object):
    def __init__(self):
        # 【1】定义目标网址
        # self.target_url = "https://movie.douban.com/top250"
        # https://movie.douban.com/top250?start=25&filter=
        # https://movie.douban.com/top250?start=50&filter=

        # 【2】定义请求头参数
        self.headers = {
            "User-Agent": UserAgent().random
        }

    def create_target_url(self):
        target_url_list = []
        # 循环取到每一页
        for i in range(1, 11):
            if i == 1:
                target_url = "https://movie.douban.com/top250"
                target_url_list.append(target_url)
            else:
                target_url = f"https://movie.douban.com/top250?start={25 * (i - 1)}&filter="
                target_url_list.append(target_url)

        return target_url_list
    
	# 写入文档
    def save(self, page_text):
        with open("doubai.html", "w", encoding="utf-8") as fp:
            fp.write(page_text)
    # 读取文档
    def read(self):
        with open("doubai.html", "r", encoding="utf-8") as fp:
            data = fp.read()
        return data

    def main(self):
        target_url_list = self.create_target_url()
        top_data = {}
        for target_url in target_url_list:
            top_data.update(self.spider_page_text(target_url))
        return top_data

    def spider_page_text(self, target_url):
        # 【3】发起请求获取响应对象
        response = requests.get(
            url=target_url,
            headers=self.headers
        )
        # 【4】获取到响应源码
        page_text = response.text
        return self.parse_page(page_text)

    def parse_page(self, page_text):
        # 【5】生成soup对象
        soup = BeautifulSoup(page_text, "lxml")
        # 【6】获取到每一个电影信息所在的li标签
        li_list = soup.select("#content > div > div.article > ol > li")
        # 【7】遍历获取每一个li标签中间的内容
        top_data = {}
        for li in li_list:
            # 获取数据，并美化
            info_list = [span.text.strip() for span in li.select("div > div.info > div.hd > a > span")]
            movie_title = info_list[0]
            movie_director = info_list[1].strip("/").strip()
            movie_type = None
            if len(info_list) == 3:
                movie_type = ", ".join([i.strip() for i in info_list[2].strip("/").strip().split("/")])
            movie_actors = [i.strip().replace("\xa0", "") for i in
                            list(li.select("div > div.info > div.bd > p:nth-child(1)")[0].strings)]
            movie_star = li.select("div > div.info > div.bd > div > span.rating_num")[0].text
            movie_comment_num = li.select("div > div.info > div.bd > div > span:nth-child(4)")[0].text
			
            # 判断存在
            try:
                movie_motto = li.select("div > div.info > div.bd > p.quote > span")[0].text
            except Exception as e:
                movie_motto = ""

            top_data[movie_title] = {
                "movie_title": movie_title,
                "movie_director": movie_director,
                "movie_type": movie_type,
                "movie_actors": movie_actors,
                "movie_star": movie_star,
                "movie_comment_num": movie_comment_num,
                "movie_motto": movie_motto
            }
        return top_data

if __name__ == '__main__':
    s = SpiderTop250()
    top_data = s.main()
    print(len(top_data))

posted on 2024-08-01 10:36 晓雾-Mist 阅读(120) 评论(0) 收藏举报

61.BeautifulSoup模块

BeautifulSoup模块

【一】初识

1）介绍

2）HTML解析器

3）三种解析器

4）示例 生成soup对象

【二】四种对象

1）介绍

2）BeautifulSoup对象

3）Tag对象

0.示例文档模型

1.查找tag对象

2.查找tag对象的标签、属性

3.修改tag的属性

4.获取标签对象的文本内容

4）NavigableString对象

5）Comment对象

【三】文档树操作

1）概念

2）常见的文档树遍历算法

3）语法

1.获取标签的名称

2.获取标签的属性

3.获取标签的内容

4.嵌套选择

5.子节点、子孙节点

6.父节点、祖先节点

7.兄弟节点

【四】搜索文档树

1）介绍

2）语法

1.查找所有语法

2.查找单个find

3.find_all() 和find()比较

4.标签名多次调用

5.拓展

【五】CSS选择器

1）官网

2）select

3）select_one

【六】案例

4）示例生成soup对象