python beautifulsoup

beautifulsoup

1. 安装

pip install beautifulsoup4

如果这个安装不了，就手动下载安装:

下载地址：https://www.crummy.com/software/BeautifulSoup/bs4/download/4.5/
解压后执行python setup.py install
拷贝python安装目录下的C:\Program Files\python\Tools\scripts\2to3.py文件到beautifulsoup解压目录下
执行python 2to3.py -w bs4
cmd中启动pyhton测试导入bs4是否成功，import bs4

2. 使用

beautifulsoup4解析器的详细使用文档见：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

beautifulsoup把html的解析为python对象，用户可以遍历tag，访问tag属性和根据id/class等信息搜索tag。

2.1 对象

beautifulsoup包含4种对象:Tag , NavigableString , BeautifulSoup , Comment,每种对象有不同的属性。

Tag
- name
- attibute
  
  tag的属性有id和class两种
```
tag["class"] # 获取class属性的值
tag["id"]    # 获取id属性的值
tag.attrs    # 返回所以属性类型的
```
- string
  如果这个节点没有子节点，只有字符串，会返回这个阶段是字符串，否则返回none
- contents
  tag.contents返回子节点的列表，包含子节点的子节点，递归遍历。
- children
  tag.children返回子节点的生成器，包含子节点的子节点，递归遍历，节省内存
- descendants
  tag.descendants返回之间子节点的生成器
- strings
  tag.strings,等于contents的内如去掉的tag标签，并转换为字符串
- stripped_strings
  tag.stripped_strings,相对于strings去掉了末尾的的回车符等空白符
- parent
  返回直接父节点
- parents
  返回所以父节点，包括父节点的父节点等。
- next_sibling
  返回下一个兄弟节点
- previous_sibling
  返回上一个兄弟节点
- next_siblings
  返回后面的所有兄弟节点
- previous_sibling
  返回前面的所有兄弟节点
- next_element
  返回下一个节点，可能是下一个兄弟节点，for遍历节点就需要调用这个函数
- previous_element
  返回前一个节点，可能是上一个兄弟节点，for逆向遍历节点就需要调用这个函数
- next_elements
  返回后面的所以节点，如果调用tag为BeautifulSoup，那么结果跟调用children一样，返回所以节点的生成器
- previous_elements
  返回前面所有节点的生成器
- has_attr("class")/has_attr("id")
  判断是否包含某个属性
NavigableString
tag.string就是NavigableString
comment
注释
BeautifulSoup
就是BeatifulSout("...", "html.parser")例化的对象，也可以认为是tag对象

2.2 遍历

利用Tag/BeautifulSoup的children、contents、next_elements、next_sibling函数来遍历所以Tag

2.3 选择器

BeautifulSoup有2个选择器，find和select

find

find()和find_all()的区别是，find_all()返回所以匹配的节点，find()只返回第一个匹配的节点

过滤器

tag名

tag.find_all('div') # 返回所以div

tag名的正则表达式

tag.find_all(re.compile('t')) # 返回包含t的所以tag(tile, html)

列表[list]

tag.find_all(['p', 'div'])  # 返回所以p和div的tag

Ture

tag.find_all(Ture)  # 返回所以tag

函数

tag.find_all(lambda tag : tag.has_attr("class") and tag["class"] == "xx")

参数

find( name , attrs , recursive , string , **kwargs ) # 返回第一个匹配的节点
find_all( name , attrs , recursive , string , **kwargs ) # 返回结构是tag的list
find_next( name , attrs , recursive , string , **kwargs ) # 返回第一个符合条件的节点
find_all_next( name , attrs , recursive , string , **kwargs ) # 返回所以符合条件的节点
find_previous( name , attrs , recursive , string , **kwargs ) # 返回第一个符合条件的节点
find_all_previous( name , attrs , recursive , string , **kwargs ) # 返回所以符合条件的节点
find_perent( name , attrs , recursive , string , **kwargs ) # 返回第一个匹配的父节点
find_perents( name , attrs , recursive , string , **kwargs ) # 返回所有匹配的父节点
find_next_sibling( name , attrs , recursive , string , **kwargs ) # 返回第一个符合条件的兄弟节点
find_next_siblings( name , attrs , recursive , string , **kwargs ) # 返回所以符合条件的兄弟节点
find_previous_sibling( name , attrs , recursive , string , **kwargs ) # 返回第一个符合条件的兄弟节点
find_previous_siblings( name , attrs , recursive , string , **kwargs ) # 返回所以符合条件的兄弟节点
findChild()
findChildren()

name:可以是上面的任意过滤器(匹配tag名字)

keyword:id、href、attrs、class_

tag.find_all(id="xx", href=re.compile("*baidu*"), attrs={"class", "id", "fool"}, class_=lambda)

recursive：默认为Ture,False表示不进行递归搜索

string：匹配tag包裹的字符串，即NavigatableString

tag.find_all(string="abc")
tag.find_all(string=["abc", "xyz"])
tag.find_all(string=re.compile("abc"))      # 能匹配到abc,也能匹配到abcd
tag.find_all(string=re.compile("^abc$"))    # 只能匹配到abc
tag.find_all(string=lambda)

attrs：匹配属性的关键字
limit：最多返回几个tag

select

BeautifulSoup支持大多数CCS选择器，https://www.w3.org/TR/CSS22/selector.html
多数情况下select要比find函数好用，select可以把约束写在一个字符串里边，且层次结构清晰

select_one()与select()的关系与find()和find_all()相同。

tag.select(sting)

这个string由tag和tag的修饰通过连接符组成，

tag和tag的修饰符包含下面这些

tag
#id
.class
[attr=xxx]
:func()
比如:not(.interfaces)表示没有interfaces这个class,:first-child，表示第一个子节点，与>一起使用。

连接符

约束a 连接符 约束b

无连接符，表示同时满足a约束和b约束
空格, 满足a约束父节点下，满足b约束的所有子节点
>，满足a约束的父节点下，满足b约束的所有直接子节点
~，返回a约束节点后面所以满足b约束的兄弟节点
+，返回a约束节点后面第一个满足b约束的兄弟节点
","，满足a约束或者b约束

tag.select("a#link.sister") # 返回<a id="link" class="sister">的标签 
# 选择<div id="content"><hr><h3 class="interface_toggle"></h3></hr>
# 选择<div id="content"><hr><div><h3 class="interface_toggle"></h3></div></hr>
all_group = bs.select('div#content>hr>h3.interface_toggle,div#content>hr>div:not(.interfaces)>h3.interface_toggle')

find/select

tag.select无法从tag位置找下一个兄弟节点，tag.find_next_sibling()可以

<h2>title</h2>
<div>
</div>

tag为h2的时候，select无法找到div，select找必须往下一级。

example

去table中的tr，同时避免取表中表的tr，table->tr->td->table->tr

tag.find_all(lambda t : t.name == 'tr' and t.find_parent('tr') is None)

取table，同时避免取table中talbe

tag.find_all(lambda t : t.name == 'talbe' and t.find_parent('table') is None)

posted @ 2023-10-12 14:04 下夕阳阅读(21) 评论(0) 编辑收藏举报

刷新页面返回顶部

Alex