爬虫学习第三天

BeautifulSoup

使用BeautifulSoup来解析html页面

import requests

url = "https://python123.io/ws/demo.html"
# 1. 使用requests库
r = requests.get(url)
demo = r.text

#2. 使用BeautifulSoup库
from bs4 import BeautifulSoup

soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())

BeautifulSoup库的基本元素


import requests
from bs4 import BeautifulSoup

url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo, "html.parser")
# 打印title
print(soup.title)
# 打印<a>标签的内容
print(soup.a)
# 打印<a>标签的属性
print(soup.a.attrs)
# 获取<a>标签class属性值
print(soup.a.attrs['class'])  # ['py1']
# 获取<a>标签中间的内容
print(soup.a.string)
# 获取<p>标签中间的内容
print(soup.p.string)
# 获取注释内容
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p></b>", "html.parser")
print(newsoup.b.string)

使用bs4遍历HTML内容

HTML的结构,一个树形结构的文本信息,,或者

文本也算结点,可能出现在遍历中。

  1. 下行遍历

  2. 上行遍历

  3. 平行遍历
    平行遍历必须发生在同一个父亲结点下。

基于bs4的HTML格式化输出

使用prettify()方法。

import requests
from bs4 import BeautifulSoup

url = "https://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text

soup = BeautifulSoup(demo, "html.parser")

# 格式化输出HTML文件
print(soup.prettify())

posted @ 2021-02-19 15:00  sxhyyq  阅读(39)  评论(0编辑  收藏  举报