bs

lxml 安装直接whl文件安装

速度快文档容错能力强

html5lib
最好的容错性
以浏览器的方式解析文档生成HTML5格式的文档速度慢
soup = BeautifulSoup(html_content, "html5lib")

BeautifulSoup的构造方法,可以传入一段字符串或一个文件句柄.
使用Beautiful Soup解析后,文档都被转换成了Unicode
BeautifulSoup用了编码自动检测子库来识别当前文档编码并转换成Unicode编码. BeautifulSoup对象的 .original_encoding 属性记录了自动识别编码的结果

我们可以通过from_encoding参数传入解码格式，以便更快更正确解码。

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup(html_content, "lxml")
soup = BeautifulSoup(content, "lxml", from_encoding='utf-8')

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码
指定其他编码 print(soup.prettify("latin-1"))

解析部分文档 parse_only参数和SoupStrainer对象
from bs4 import SoupStrainer

only_a_tags = SoupStrainer("a")

BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags)
将只生产a标签相关的内容

提高效率
1.直接使用lxml
2.使用lxml解析器
3.安装cchardet
4.解析部分文档不会节省多少解析时间,但是会节省很多内存,并且搜索时也会变得更快.

soup大部分时候可以当成tag，soup没有attribute属性

attrs 本tag的属性
get('id') 得到id属性

也可以直接 soup.a['href'] 获取

contents 子节点列表
contents[0] 获取第一个子节点
children 迭代对象本tag的子tag

find 获取第一个
findAll find_all 获取所有

get_text getText text 获取所有text

string 返回本级别text，只能有一个子节点，多个子节点返回None
strings 获取所有text 返回迭代对象

parent 直接父节点
parents 一级一级的往上获取所有父辈是一个迭代

select

CSS选择器
Beautiful Soup支持大部分的CSS选择器 ,在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:
soup.select("title")

p:nth-child(2) 选择属于其父元素的第二个<p>元素的所有元素的第二个。
p:nth-of-type(2) 选择属于其父元素第二个 <p> 元素所有<p>元素的第二个。
bs4 仅仅支持nth-of-type
soup.select("ul li:nth-of-type(3)")

通过tag标签逐层查找: 可以隔代
soup.select("body a")

找到某个tag标签下的直接子标签直系
soup.select("head > title")

.class #id

通过CSS的类名查找:
soup.select(".sister")

通过tag的id查找:
soup.select("#link1")

http://www.w3school.com.cn/cssref/css_selectors.ASP

bs首先文档转换为unicode 如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

tag.body.li tag直接获取html标签
tag.next_sibling
tag.previous_sibling

1.传入字符串
soup.find_all('a')
2.正则表达式
soup.find_all(re.complie("^b"))
3.列表
soup.find_all(["a", "b"]) a或者b的都返回
4.方法
soup.find_all(func)
def func(tag):
return True or False //True的会获取
5.属性
soup.find_all(href=re.compile("elsie"), id='link1')
//同时过滤

6.css搜索
soup.find_all("a", attrs={"class": "sister"})
css_soup.find_all("p", class_="body strikeout") class顺序要与文档一致

7.text搜索
soup.find_all(text="Elsie")
混合
soup.find_all("a", text="Elsie")

limit限制返回2个
soup.find_all("a", limit=2)

recursive设置是否递归搜索
soup.html.find_all("title", recursive=False) //只搜索当前直接子节点

编码自动检测
from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
print(dammit.unicode_markup)
# Sacré bleu!
dammit.original_encoding
# 'utf-8'

posted @ 2017-01-19 22:42 十年闷油瓶阅读(414) 评论(0) 收藏举报

刷新页面返回顶部

十年闷油瓶

bs

公告