python BeautifulSoup

灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便的实现网页信息的提取

安装BeautifulSoup

pip3 install beautifulsoup4

解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,"html,parser")	Python的内置标准库，执行速度适中、文档容错能力强	Python2.7.3 or 3.2.2前的版本中文容错能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup,"xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

基本使用

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

标签选择器

选择元素

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

获取名称

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

获取属性

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

获取内容

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

嵌套选择

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

字节点和子孙节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

父节点和祖先节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.parent)

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.p.parent)))

兄弟节点

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档

name

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('a'))
print(type(soup.find_all('a')[0]))

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie --></a>,
<a href="http://example.com/lacle" class="sister" id="link2">Lacle</a> and
<a href="http://example.com/title" class="sister" id="link3">Title</a>;
and they lived at the bottom of a well</p>
<p class = "story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for p in soup.find_all('p'):
    print(p.find_all('b'))

attrs

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

text

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')>
print(soup.find_all(text='Foo'))

find(name,attrs,recursive,text,**kwargs)

find返回单个元素，find_all返回所有元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点，find_next()返回第一个符合条件的节点

find_all_previous()和 find_previous()

find_all_previous()返回节点前所有符合条件的节点，find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

获取属性

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul('id'))
    print(ul.attrs['id'])

获取内容

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="pannel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div
<div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('li'):
    print(ul.get_text())

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all()查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select

posted @ 2018-10-18 14:43 蒲群柱阅读(135) 评论(0) 编辑收藏举报

刷新页面返回顶部

蒲群柱

揽风华之绝貌，畅人间之逍遥，是非不言败天亦老，一种相思两处闲，此情无计可消已！

python BeautifulSoup

安装BeautifulSoup

解析库

基本使用

标签选择器

选择元素

获取名称

获取属性

获取内容

嵌套选择

父节点和祖先节点

兄弟节点

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档

name

attrs

text

find(name,attrs,recursive,text,**kwargs)

find返回单个元素，find_all返回所有元素

find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点，find_next()返回第一个符合条件的节点

find_all_previous()和 find_previous()

find_all_previous()返回节点前所有符合条件的节点，find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

总结

公告