BeautifulSoup 的简单使用

Beautiful Soup初了解

  1. 解析工具Beautiful Soup,借助网页的结构和属性等特性来解析网页(简单的说就是python的一个HTML或XML的解析库)

  2. Beautiful Soup支持的解析器有很多:Python标准库、lxml HTML解析器、lxmlXML解析器、html5lib

实例引入:

复制
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)
# 输出:
Hello

BeautifulSoup 的基本用法

实例引入:

复制
from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify(), soup.title.string, sep='\n\n')
# 初始化BeautifulSoup时,自动更正了不标准的HTML
# prettify()方法可以把要解析的字符串以标准的缩进格式输出
# soup.title 可以选出HTML中的title节点,再调用string属性就可以得到里面的文本了
复制
# 输出:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
The Dormouse's story

结点选择器

  1. 选择元素

    复制
    from bs4 import BeautifulSoup
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title) # 打印输出title节点的选择结果
    print(type(soup.title)) # 输出soup.title类型
    print(soup.title.string) # 输出title节点的内容
    print(soup.head) # 打印输出head节点的选择结果
    print(soup.p) # 打印输出p节点的选择结果
    # 输出:
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    The Dormouse's story
    <head><title>The Dormouse's story</title></head>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  2. 提取信息

    复制
    说明:
    调用string属性获取文本的值
    利用那么属性获取节点的名称
    调用attrs获取所有HTML节点属性
    复制
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.title.name) # 选取title节点,然后调用name属性获得节点名称
    # 输出:title
    print(soup.title.string) # 调用string属性,获取title节点的文本值
    # 输出:The Dormouse's story
    print(soup.p.attrs) # 调用attrs,获取p节点的所有属性
    # 输出:{'class': ['title'], 'name': 'dromouse'}
    print(soup.p.attrs['name']) # 获取name属性
    # 输出:dromouse
    print(soup.p['name']) # 获取name属性
    # 输出:dromouse
  3. 关联选择

    1. 子节点和子孙节点

      1. contents属性获取直接子结点(生的的是列表)

        复制
        from bs4 import BeautifulSoup
        html = """
        <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
        </a>
        ,
        <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        ;
        and they lived at the bottom of a well.
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        """
        soup = BeautifulSoup(html, 'lxml')
        # 选取节点元素之后,可以调用contents属性获取它的直接子节点
        print(soup.p.contents)
        # 输出:
        ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
        </a>, '\n ,\n ', <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>, '\n ;\nand they lived at the bottom of a well.\n ']
        # 返回结果是一个列表,列表中的元素是所选节点的直接子节点(不包括孙节点)
      2. children属性,返回结果是生成器类型。与contents属性一样,只是返回结果类型不同。

        复制
        from bs4 import BeautifulSoup
        html = """
        <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        ,
        <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        ;
        and they lived at the bottom of a well.
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        """
        soup = BeautifulSoup(html, 'lxml')
        print(soup.p.children) # 输出:<list_iterator object at 0x1159b7668>
        for i, child in enumerate(soup.p.children):
        print(i, child)
        # for 循环的输出结果:
        0
        Once upon a time there were three little sisters; and their names were
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        2
        ,
        3 <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        4
        and
        5 <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        6
        ;
        and they lived at the bottom of a well.
      3. descendants属性会递归查询所有子节点,得到所有子孙节点。

        复制
        from bs4 import BeautifulSoup
        html = """
        <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        ,
        <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        ;
        and they lived at the bottom of a well.
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        """
        soup = BeautifulSoup(html, 'lxml')
        print(soup.p.descendants) # 输出:<generator object Tag.descendants at 0x1131d0048>
        for i, child in enumerate(soup.p.descendants):
        print(i, child)
        # for 循环输出结果:
        0
        Once upon a time there were three little sisters; and their names were
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        2
        3 <span>Elsie</span>
        4 Elsie
        5
        6
        ,
        7 <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        8
        Lacie
        9
        and
        10 <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        11
        Tillie
        12
        ;
        and they lived at the bottom of a well.
    2. 父节点和祖先节点

      1. parent获取某个节点的一个父结点

        复制
        from bs4 import BeautifulSoup
        html = """
        <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        """
        soup = BeautifulSoup(html, 'lxml')
        print(soup.a.parent)
        # 输出:
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
      2. parent获取所有祖先结点

        复制
        from bs4 import BeautifulSoup
        3 html = """
        <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        """
        soup = BeautifulSoup(html, 'lxml')
        print(soup.a.parents, type(soup.a.parents), list(enumerate(soup.a.parents)), sep='\n\n')
        # 输出:
        <generator object PageElement.parents at 0x11c76e048>
        <class 'generator'>
        [(0, <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>), (1, <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
        ...
        </p>
        </body>), (2, <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>), (3, <html>
        <head>
        <title>
        The Dormouse's story
        </title>
        </head>
        <body>
        <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
        ...
        </p>
        </body>
        </html>
        )]
    3. 兄弟节点

      复制
      from bs4 import BeautifulSoup
      html = """
      <html>
      <head>
      <title>
      The Dormouse's story
      </title>
      </head>
      <body>
      <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">
      <span>Elsie</span>
      </a>
      ,
      <a class="sister" href="http://example.com/lacie" id="link2">
      Lacie
      </a>
      and
      <a class="sister" href="http://example.com/tillie" id="link3">
      Tillie
      </a>
      ;
      and they lived at the bottom of a well.
      </p>
      <p class="story">
      ...
      </p>
      </body>
      </html>
      """
      soup = BeautifulSoup(html, 'lxml')
      print(
      # 获取下一个兄弟元素
      {'Next Sibling': soup.a.next_sibling},
      # 获取上一个兄弟元素
      {'Previous Sibling': soup.a.previous_sibling},
      # 返回后面的兄弟元素
      {'Next Siblings': list(enumerate(soup.a.next_siblings))},
      # 返回前面的兄弟元素
      {'Previous Siblings': list(enumerate(soup.a.previous_siblings))},
      sep='\n\n'
      )
      # 输出:
      {'Next Sibling': '\n ,\n '}
      {'Previous Sibling': '\n Once upon a time there were three little sisters; and their names were\n '}
      {'Next Siblings': [(0, '\n ,\n '), (1, <a class="sister" href="http://example.com/lacie" id="link2">
      Lacie
      </a>), (2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">
      Tillie
      </a>), (4, '\n ;\nand they lived at the bottom of a well.\n ')]}
      {'Previous Siblings': [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]}
    4. 提取信息

      复制
      from bs4 import BeautifulSoup
      html = """
      <html>
      <body>
      <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      </p>
      </body>
      </html>
      """
      soup = BeautifulSoup(html, 'lxml')
      print(
      'Next Sibling:',
      [soup.a.next_sibling], # 获取上一个兄弟节点
      # \n
      type(soup.a.next_sibling), # 上一个兄弟节点的类型
      # <class 'bs4.element.NavigableString'>
      [soup.a.next_sibling.string], # 获取上一个兄弟节点的内容
      # \n
      sep='\n'
      )
      print(
      'Parent:',
      [type(soup.a.parents)], # 获取所有的祖先节点
      # <class 'generator'>
      [list(soup.a.parents)[0]], # 获取第一个祖先节点
      # <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      </p>
      [list(soup.a.parents)[0].attrs['class']], # 获取第一个祖先节点的"class属性"的值
      # ['story']
      sep='\n'
      )
      # 为了输出返回的结果,均以列表形式
      # 输出:
      Next Sibling:
      ['\n']
      <class 'bs4.element.NavigableString'>
      ['\n']
      Parent:
      [<class 'generator'>]
      [<p class="story">
      Once upon a time there were three little sisters; and their names were
      <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      </p>]
      [['story']]
  4. 嵌套选择

    复制
    from bs4 import BeautifulSoup
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title)
    print(type(soup.head.title))
    print(soup.head.title.string)
    # 输出:
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    The Dormouse's story

方法选择器

复制
find_all(name=None, attrs={}, recursive=True, text=None, limit=None)
  1. 查询所有符合条件的元素

    复制
    from bs4 import BeautifulSoup
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(name='li'),
    type(soup.find_all(name='li')[0]),
    sep='\n\n')
    # 输出:
    [<li class="item-O"><a href="linkl.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html">third item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a>
    </li>]
    <class 'bs4.element.Tag'>
    # 返回值是一个列表,列表的元素是名为"li"的节点,每个元素都是bs4.element.Tag类型
    # 遍历每个a节点
    from bs4 import BeautifulSoup
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    soup = BeautifulSoup(html, 'lxml')
    li = soup.find_all(name='li')
    for a in li:
    print(a.find_all(name='a'))
    # 输出:
    [<a href="linkl.html">first item</a>]
    [<a href="link2.html">second item</a>]
    [<a href="link3.html">third item</a>]
    [<a href="link4.html">fourth item</a>]
    [<a href="link5.html">fifth item</a>]
  2. attires 参数

    复制
    from bs4 import BeautifulSoup
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(attrs={'class': 'item-0'}))
    print(soup.find_all(attrs={'href': 'link5.html'}))
    # 输出:
    [<li class="item-0"><a href="link5.html">fifth item</a>
    </li>]
    [<a href="link5.html">fifth item</a>]
    # 可以通过attrs参数传入一些属性来进行查询,即通过特定的属性来查询
    # find_all(attrs={'属性名': '属性值', ......})
  3. text 参数

    复制
    from bs4 import BeautifulSoup
    import re
    html = """
    <div class="panel">
    <div class="panel-body">
    <a>Hello, this is a link</a>
    <a>Hello, this is a link, too</a>
    <div/>
    <div/>
    """
    soup = BeautifulSoup(html, 'lxml')
    # 正则表达式规则对象
    regular = re.compile('link')
    # text参数课用来匹配节点的文本,传入的形式可以是字符串,也可以是正则表达式对象
    print(soup.find_all(text=regular))
    # 正则匹配输出
    print(re.findall(regular, html))
    # 输出:
    ['Hello, this is a link', 'Hello, this is a link, too']
    ['link', 'link']

说明:

复制
find(name=None, attrs={}, recursive=True, text=None)
# 仅返回与给定条件匹配标记的第一个元素

CSS选择器

  1. Beautiful Soup 提供了CSS选择器,调用select()方法即可

  2. css选择器用法:http://www.w3school.com.cn/cssref/css_selectors.asp

  3. 方法

    复制
    select(selector, namespaces=None, limit=None)
  4. 简单实例

    复制
    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    ul_all = soup.select('ul')
    print(ul_all)
    for ul in ul_all:
    print()
    print(
    ul['id'],
    ul.select('li'),
    sep='\n'
    )
    # 输出:
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    list-1
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    list-2
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
  5. 获取属性

    复制
    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    ul_all = soup.select('ul')
    print(ul_all)
    for ul in ul_all:
    print()
    print(
    ul['id'],
    ul.attrs['id'],
    sep='\n'
    )
    # 直接传入中括号和属性名 或者 通过attrs属性获取属性值 都可以成功获得属性值
    # 输出:
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    list-1
    list-1
    list-2
    list-2
  6. 获取文本

    复制
    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    ul_all = soup.select('li')
    print(ul_all)
    for li in ul_all:
    print()
    print(
    'get_text()方法获取文本:'+li.get_text(),
    'string属性获取文本:'+li.string,
    sep='\n'
    )
    # 输出:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    get_text()方法获取文本:Foo
    string属性获取文本:Foo
    get_text()方法获取文本:Bar
    string属性获取文本:Bar
    get_text()方法获取文本:Jay
    string属性获取文本:Jay
    get_text()方法获取文本:Foo
    string属性获取文本:Foo
    get_text()方法获取文本:Bar
    string属性获取文本:Bar
posted @   LeeHua  阅读(285)  评论(0编辑  收藏  举报
编辑推荐:
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
点击右上角即可分享
微信分享提示

目录导航