Beautifulsoup
获取网页源代码
import requests from bs4 import BeautifulSoup kv = {'user-agent':'Mozilla/5.0'} url = "https://python123.io/ws/demo.html" r = requests.get(url,headers = kv) print(r.status_code) demo = r.text soup = BeautifulSoup(demo,"html.parser")#解析 print(soup.prettify())
200
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
BeautifulSoup的使用
BeautifulSoup库解析器
BeautifulSoup类的基本元素
https://python123.io/ws/demo.html
import requests from bs4 import BeautifulSoup kv = {'user-agent':'Mozilla/5.0'} url = "https://python123.io/ws/demo.html" r = requests.get(url,headers = kv) print(r.status_code) demo = r.text soup = BeautifulSoup(demo,"html.parser") print(soup.title) tag = soup.a#只能返回一个a标签 print(tag)
200
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
a
p
body
print(tag.attrs['href'])
print(type(tag.attrs))字典
print(type(tag))
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>
tag = soup.a print(tag) print(tag.string) tag1 = soup.p print(tag1) print(tag1.string) tag2 = soup.b print(tag2) print(tag2.string)
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
print(type(tag2.string))
<class 'bs4.element.NavigableString'>
soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") print(soup.b.string) print(type(soup.b.string)) print(soup.p.string) print(type(soup.p.string))
This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
上述的标签树如下
三种遍历方式
下行遍历
soup = BeautifulSoup(demo,"html.parser") print(soup.head) print(soup.head.contents) print(soup.body.contents)返回列表 print(len(soup.body.contents)) print(soup.body.contents[1])
<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the foll
owing courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
5
<p class="title"><b>The demo python introduces several python courses.</b></p>
for child in soup.body.children:
print(child) # 遍历儿子节点
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
for child in soup.body.descendants: print(child) # 遍历子孙节点
<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.
上行遍历
for parent in soup.a.parents: # 遍历soup的a标签的先辈标签 if parent is None: print(parent) else: print(parent.name)
p
body
html
[document]
soup = BeautifulSoup(demo,"html.parser") tag = soup.a print(tag) print(tag.parent)
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
强调:soup.html的parent是它本身 soup.parent是空的
for parent in soup.a.parents: # 遍历soup的a标签的先辈标签 if parent is None: print( parent) else: print(parent.name)
p
body
html
[document]
平行遍历
平行遍历发生在同一父节点的各节点间
标签间的NavigableString也会构成标签树的节点,那么某个节点的父节点、子节点或者平行标签都有可能是NavigableString类型的
soup = BeautifulSoup(demo,"html.parser") tag = soup.a print(tag.next_sibling) print(tag.next_sibling.next_sibling) print(tag.previous_sibling) print(tag.previous_sibling.previous_sibling)
print(tag.parent)
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())#在每个标签后面加了一个换行符,便于美观的输出
bs4的编码默认都为utf-8编码
soup = BeautifulSoup("<p>你好</p>","html.parser")
print(soup.p.string)
print(soup.p.prettify())