Python网络爬虫与信息提取（二）（BeautifulSoup库）

BeautifulSoup库是解析、遍历、维护.html或.xml的功能库

①BeautifulSoup库的安装：

在cmd命令行中输入： pip install beautifulsoup4即可

②BeautifulSoup库的引用：

from bs4 import BeautifulSoup

BeautifulSoup库，也叫beautifulsoup4或bs4

③检测Beautiful Soup库是否安装成功以及使用BeautifulSoup库对网页进行解析：

整个解析过程的主要代码：

from bf4 import BeautifulSoup
soup=BeautifulSoup('<p>data</p>','html.parser')

④BeautifulSoup库的四种解析器：

⑤BeautifulSoup类的基本元素及相应用法：

在DOS命令下：

C:\Users\Administrator\python

>>>import requests

>>>r=requests.get(“http://python123.io/ws/demo.html”)

>>>r.text

>>>demo=r.text

>>>from bs4 import BeautifulSoup

>>>soup=BeautifulSoup(demo,”html.parser”)

>>>print(soup.prettify())

>>>soup.title

>>>tag=soup.a

>>>tag

Comment的用法：

⑥基于bs4库的HTML内容遍历方法

标签树的下行遍历：

遍历儿子节点 ==> for child in soup.body.children:

print(child)

遍历子孙节点 ==> for child in soup.body.children:

print(child)

标签树的上行遍历：

属性 .parent 说明节点的父类标签

属性 .parents 说明节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历：

平行遍历发生在同一个父节点下的各节点间

1）遍历后续节点

for sibling in soup.a.next_siblings:

print(sibling)

2)遍历前续节点

for sibling in soup.a.previous_siblings:

print(sibling)

posted @ 2019-07-12 15:02 yyer 阅读(295) 评论(0) 编辑收藏举报

刷新页面返回顶部

yyer's blog