[python3 - package] BeautifulSoup
1. 安装 pip install BeautifulSoup
2. 官方文档 - https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. 常用的Object类型
from bs4 import BeautifulSoup bsObj = BeautifulSoup('<p style="float:left">Chapter 1</p>', 'html.parser') #BeautifulSoup Object tagObj = bsObj.p #Tag Object navStrObj = tagObj.string #NavigableString Object
4. 常用API
- bsObj.findAll(tag, attrs, recursive, string, limit, **kwargs)
- 最少要有一个参数,可以是tag/attrs/string/keyword(指的是tag中的attribute,如: href, id)
- 除了tag可以直接写tag名字以外,其他都需要表达式
- Return - Tag Object list(如果只有string参数,则返回NavigableString list)
- Difference with bsObj.find()
-
- 意思上等价于bsObj.findAll(limit=1)
- 但返回值类型不同,find()返回string,如果为空则返回None
- 返回的是html文件中找到的第一个元素,不一定是页面上看到的第一个
- bsObj.tag
- Return - 第一个出现的这个tag
- bsObj.tag.get_text()
- Return - tag中所包含的内容(String类型)
- 如果tag中还包含有其他tag,返回的string中同样包含子tag中的内容
- bsObj.tag.children/bsObj.tag.descendants
- Return - tag list
- tag object的属性,可以用在find()后
- children是最近的sub tag/descendants包括是所有sub tag
- bsObj.tag.next_sibling(s)/bsObj.tag.previous_sibling(s)/bsObj.tag.parent(s)
- 复数 - Return tag list
- 单数 - Return 最近的一个tag
5. Sample
HTML
<html> <body> <span class="red yellow">Story1</span> <span class="green">Story2</span> <span class="red">Story3</span> <span class="green" id="four">Story4</span> </body> </html>
Python
data = bsObj.findAll('span') # [<span class="red yellow">Story1</span>, <span class="green">Story2</span>, <span class="red">Story3</span>, <span class="green" id="four">Story4</span>] #同时满足两个属性 data = bsObj.findAll(attrs = {'id':'four', 'class': 'green'}) # [<span class="green" id="four">Story4</span>] #同时满足一个属性的多个值,顺序也必须相同 data = bsObj.findAll(attrs = {'class': 'red yellow'}) # [<span class="red yellow">Story1</span>] data = bsObj.findAll(attrs = {'class': 'yellow red'}) # [] #只输入string参数,返回NavigableString list data = bsObj.findAll(string='Story1') # ['Story1'] data = bsObj.findAll(string=['Story1','Story2']) # ['Story1', 'Story2'] #keyword是class时,需要加下划线,避免和python关键词class冲突 data = bsObj.findAll(class_='green') # [<span class="green">Story2</span>, <span class="green" id="four">Story4</span>] #是否包含某个属性 data = bsObj.findAll(id=True) # [<span class="green" id="four">Story4</span>] #tag中所包含的内容 data = bsObj.findAll(id=True)[0].get_text() # Story4 (这里是string类型,而不是NavigableString)