Beautifulsoup

Beautiful Soup:解析HTML页面信息标记与提取方法

 

获取网页源代码

import requests
from bs4 import BeautifulSoup

kv = {'user-agent':'Mozilla/5.0'}
url = "https://python123.io/ws/demo.html"
r = requests.get(url,headers = kv)
print(r.status_code)
demo = r.text
soup = BeautifulSoup(demo,"html.parser")#解析
print(soup.prettify())

 

200
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

 

 

 

<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>

 

 

 

BeautifulSoup的使用

 

 

 

 

 

 

 BeautifulSoup库解析器

 

 

  BeautifulSoup类的基本元素

 

 

 

 

 

 

https://python123.io/ws/demo.html

 

import requests
from bs4 import BeautifulSoup
kv = {'user-agent':'Mozilla/5.0'}
url = "https://python123.io/ws/demo.html"
r = requests.get(url,headers = kv)
print(r.status_code)
demo = r.text
soup = BeautifulSoup(demo,"html.parser")
print(soup.title)
tag = soup.a#只能返回一个a标签
print(tag)

200
<title>This is a python demo page</title>
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

 

print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)

a
p
body

 

print(tag.attrs['href'])
print(type(tag.attrs))字典
print(type(tag))

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
http://www.icourse163.org/course/BIT-268001
<class 'dict'>
<class 'bs4.element.Tag'>



tag = soup.a
print(tag)
print(tag.string)
tag1 = soup.p
print(tag1)
print(tag1.string)
tag2  = soup.b
print(tag2)
print(tag2.string)

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
<p class="title"><b>The demo python introduces several python courses.</b></p>
The demo python introduces several python courses.
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.

print(type(tag2.string))

<class 'bs4.element.NavigableString'>

 

soup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
print(soup.b.string)
print(type(soup.b.string))
print(soup.p.string)
print(type(soup.p.string))

This is a comment
<class 'bs4.element.Comment'>
This is not a comment
<class 'bs4.element.NavigableString'>

 

基于bs4库的HTML内容遍历方法

<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>

 上述的标签树如下 

三种遍历方式

 下行遍历

 

 

 

soup = BeautifulSoup(demo,"html.parser")
print(soup.head)
print(soup.head.contents)
print(soup.body.contents)返回列表
print(len(soup.body.contents))
print(soup.body.contents[1])

<head><title>This is a python demo page</title></head>
[<title>This is a python demo page</title>]
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the foll
owing courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
5
<p class="title"><b>The demo python introduces several python courses.</b></p>

 

for child in soup.body.children:
print(child) # 遍历儿子节点


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

for child in soup.body.descendants:
     print(child) # 遍历子孙节点

 


<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python
.


上行遍历

 


 

for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
   if parent is None:
       print(parent)
   else:
       print(parent.name)

 

p
body
html
[document]

soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag)
print(tag.parent)

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

强调:soup.html的parent是它本身  soup.parent是空的

for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
   if parent is None:
       print( parent)
   else:
       print(parent.name)

p
body
html
[document]

 

 

平行遍历

 

 

平行遍历发生在同一父节点的各节点间

 

 

标签间的NavigableString也会构成标签树的节点,那么某个节点的父节点、子节点或者平行标签都有可能是NavigableString类型的

soup = BeautifulSoup(demo,"html.parser")
tag = soup.a
print(tag.next_sibling)
print(tag.next_sibling.next_sibling)
print(tag.previous_sibling)
print(tag.previous_sibling.previous_sibling)

 

 

print(tag.parent)

<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

 

 

 

 

and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.

 

 

 

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

 

基于bs4库的HTML格式输出 
soup = BeautifulSoup(demo,"html.parser")
print(soup.prettify())#在每个标签后面加了一个换行符,便于美观的输出

 

 

bs4的编码默认都为utf-8编码
soup = BeautifulSoup("<p>你好</p>","html.parser")
print(soup.p.string)
print(soup.p.prettify())

 

 



posted on 2020-05-17 22:37  cltt  阅读(361)  评论(0编辑  收藏  举报

导航