Python爬虫学习笔记(六)
BS4:
参考文档:
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
Test1(简单使用):
文本代码:
"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
测试代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')
# 2.格式化输出(补全)
result = soup.prettify()
print(result)
返回:
E:\Python3.9\python.exe H:/code/Python爬虫/Day07/01-beautiful_soup.py
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
Process finished with exit code 0
Test2(读取内容):
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')
# 2.解析数据
result1 = soup.head
result2 = soup.p
result3 = soup.a
print(result1)
print(result2)
print(result3)
# 3.读取内容
result4 = soup.a.string
print(result4)
# 4.读取属性
result5 = soup.a['href']
print(result5)
返回:
<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
http://example.com/elsie
注:
由返回结果可知,读取标签时只能读取第一个目标标签
Test3(四大对象):
四大对象:
Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,
每个节点都是Python对象,
所有对象可以归纳为4种:
Tag , NavigableString , BeautifulSoup , Comment .
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup))
# 2.解析数据
# Tag标签对象 bs4.element.Tag
result1 = soup.head
result2 = soup.p.string
print(result2)
result3 = soup.a
print(type(result1))
# 注释的内容类型 => bs4.element.Comment
print(type(result2))
print(type(result3))
# 3.读取内容 NavigableString
result4 = soup.a.string
print(type(result4))
# 4.读取属性
result5 = soup.a['href']
print(type(result5))
print(type(soup))
返回:
<class 'bs4.BeautifulSoup'>
s1mpL3...
<class 'bs4.element.Tag'>
<class 'bs4.element.Comment'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'str'>
<class 'bs4.BeautifulSoup'>
Test4(通用方法 - find()):
概述:
find -- 返回符合查询条件的第一个标签对象
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签
result1 = soup.find(name="p")
result2 = soup.find(attrs={"class": "title"})
result3 = soup.find(text="Tillie")
result4 = soup.find(
name="p",
attrs={"class": "title"},
)
print(result1)
print(result2)
print(result3)
print(result4)
返回:
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
Tillie
<p class="title"><b>The Dormouse's story</b></p>
Test5(通用方法 - find_all()):
概述:
findall -- 返回列表(list)标签对象
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# findall -- 返回列表(list)标签对象
result1 = soup.find_all('a')
result2 = soup.find_all("a", limit=1)[0] # 该写法即为find()方法的源码
result3 = soup.find_all(attrs={"class": "sister"})
print(result1)
print(result2)
print(result3)
返回:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Test6(通用方法 - select_one()):
概述:
select_one -- CSS选择器
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签# select_one -- CSS选择器
# 查看该函数源码可知有limit限制,即limit=1
result1 = soup.select_one('.sister')
print(result1)
返回:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Test7(通用方法 - select()):
概述:
select -- CSS选择器(list)
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# select -- CSS选择器(list)
result1 = soup.select('.sister')
result2 = soup.select('#one')
result3 = soup.select('head title')
result4 = soup.select('title, .title')
result5 = soup.select('a[id="link3"]')
print(result1)
print(result2)
print(result3)
print(result4)
print(result5)
返回:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Test8(通用方法 - get_text()):
代码:
# coding=gbk
from bs4 import BeautifulSoup
html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# 标签包裹内容 --- list
result1 = soup.select('b')[0].get_text()
# 标签的属性
result2 = soup.select('#link1')[0].get('href')
print(result1)
print(result2)
返回:
The Dormouse's story
http://example.com/elsie
XML:
数据交互格式:
前端,移动端和后台交互的数据格式
参数:
服务器,[ ],dict = {}
key = value
<key>value</key>
The Working Class Must Lead!