BeautifulSoup 的简单使用

Beautiful Soup初了解

  1. 解析工具Beautiful Soup,借助网页的结构和属性等特性来解析网页(简单的说就是python的一个HTML或XML的解析库)

  2. Beautiful Soup支持的解析器有很多:Python标准库、lxml HTML解析器、lxmlXML解析器、html5lib

实例引入:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string)

# 输出:
Hello

BeautifulSoup 的基本用法

实例引入:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify(), soup.title.string, sep='\n\n')
# 初始化BeautifulSoup时,自动更正了不标准的HTML
# prettify()方法可以把要解析的字符串以标准的缩进格式输出
# soup.title 可以选出HTML中的title节点,再调用string属性就可以得到里面的文本了
# 输出:
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

The Dormouse's story

结点选择器

  1. 选择元素

    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    
    print(soup.title)               # 打印输出title节点的选择结果
    print(type(soup.title))         # 输出soup.title类型
    print(soup.title.string)        # 输出title节点的内容
    print(soup.head)                # 打印输出head节点的选择结果
    print(soup.p)                   # 打印输出p节点的选择结果
    
    
    # 输出:
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    The Dormouse's story
    <head><title>The Dormouse's story</title></head>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    
  2. 提取信息

    说明:
    调用string属性获取文本的值
    利用那么属性获取节点的名称
    调用attrs获取所有HTML节点属性
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    
    print(soup.title.name)          # 选取title节点,然后调用name属性获得节点名称
    # 输出:title
    print(soup.title.string)        # 调用string属性,获取title节点的文本值
    # 输出:The Dormouse's story
    
    print(soup.p.attrs)             # 调用attrs,获取p节点的所有属性
    # 输出:{'class': ['title'], 'name': 'dromouse'}
    
    print(soup.p.attrs['name'])         # 获取name属性
    # 输出:dromouse
    print(soup.p['name'])               # 获取name属性
    # 输出:dromouse
    
  3. 关联选择

    1. 子节点和子孙节点

      1. contents属性获取直接子结点(生的的是列表)

        from bs4 import BeautifulSoup
        
        html = """
        <html>
         <head>
          <title>
           The Dormouse's story
          </title>
         </head>
         <body>
          <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
            <!-- Elsie -->
           </a>
           ,
           <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>
           and
           <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
           </a>
           ;
        and they lived at the bottom of a well.
          </p>
          <p class="story">
           ...
          </p>
         </body>
        </html>
        """
        
        soup = BeautifulSoup(html, 'lxml')
        # 选取节点元素之后,可以调用contents属性获取它的直接子节点
        print(soup.p.contents)
        
        # 输出:
        ['\n   Once upon a time there were three little sisters; and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1">
        <!-- Elsie -->
        </a>, '\n   ,\n   ', <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>, '\n   and\n   ', <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
           </a>, '\n   ;\nand they lived at the bottom of a well.\n  ']
        # 返回结果是一个列表,列表中的元素是所选节点的直接子节点(不包括孙节点)
        
      2. children属性,返回结果是生成器类型。与contents属性一样,只是返回结果类型不同。

        from bs4 import BeautifulSoup
        
        html = """
        <html>
         <head>
          <title>
           The Dormouse's story
          </title>
         </head>
         <body>
          <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
            <span>Elsie</span>
           </a>
           ,
           <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>
           and
           <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
          </a>
           ;
        and they lived at the bottom of a well.
          </p>
          <p class="story">
           ...
          </p>
         </body>
        </html>
        """
        
        soup = BeautifulSoup(html, 'lxml')
        print(soup.p.children)                          # 输出:<list_iterator object at 0x1159b7668>
        for i, child in enumerate(soup.p.children):
            print(i, child)
        
        
        # for 循环的输出结果:
        0 
           Once upon a time there were three little sisters; and their names were
           
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        2 
           ,
           
        3 <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>
        4 
           and
           
        5 <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
           </a>
        6 
           ;
        and they lived at the bottom of a well.
        
      3. descendants属性会递归查询所有子节点,得到所有子孙节点。

        from bs4 import BeautifulSoup
        
        html = """
         <html>
         <head>
          <title>
           The Dormouse's story
          </title>
         </head>
         <body>
          <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
            <span>Elsie</span>
           </a>
           ,
           <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>
           and
           <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
           </a>
           ;
        and they lived at the bottom of a well.
          </p>
          <p class="story">
           ...
          </p>
         </body>
        </html>
        """
        
        soup = BeautifulSoup(html, 'lxml')
        print(soup.p.descendants)                          # 输出:<generator object Tag.descendants at 0x1131d0048>
        for i, child in enumerate(soup.p.descendants):
            print(i, child)
        
        
        
        # for 循环输出结果:
        0 
           Once upon a time there were three little sisters; and their names were
           
        1 <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        2 
        
        3 <span>Elsie</span>
        4 Elsie
        5 
        
        6 
           ,
           
        7 <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
           </a>
        8 
            Lacie
           
        9 
           and
           
        10 <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
           </a>
        11 
            Tillie
           
        12 
           ;
        and they lived at the bottom of a well.
        
    2. 父节点和祖先节点

      1. parent获取某个节点的一个父结点

        from bs4 import BeautifulSoup
        
        html = """
        <html>
         <head>
          <title>
           The Dormouse's story
          </title>
         </head>
         <body>
          <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
            <span>Elsie</span>
           </a>
          </p>
          <p class="story">
           ...
          </p>
         </body>
        </html>
        """
        
        soup = BeautifulSoup(html, 'lxml')
        print(soup.a.parent)
        
        
        # 输出:
        <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        
      2. parent获取所有祖先结点

        from bs4 import BeautifulSoup
        
         3 html = """
        <html>
         <head>
          <title>
           The Dormouse's story
          </title>
         </head>
         <body>
          <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
            <span>Elsie</span>
           </a>
          </p>
          <p class="story">
           ...
          </p>
         </body>
        </html>
        """
        
        soup = BeautifulSoup(html, 'lxml')
        print(soup.a.parents, type(soup.a.parents), list(enumerate(soup.a.parents)), sep='\n\n')
        
        
        # 输出:
        <generator object PageElement.parents at 0x11c76e048>
        
        <class 'generator'>
        
        [(0, <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>), (1, <body>
        <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
           ...
          </p>
        </body>), (2, <html>
        <head>
        <title>
           The Dormouse's story
          </title>
        </head>
        <body>
        <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
           ...
          </p>
        </body>
        </html>), (3, <html>
        <head>
        <title>
           The Dormouse's story
          </title>
        </head>
        <body>
        <p class="story">
           Once upon a time there were three little sisters; and their names were
           <a class="sister" href="http://example.com/elsie" id="link1">
        <span>Elsie</span>
        </a>
        </p>
        <p class="story">
           ...
          </p>
        </body>
        </html>
        )]
        
    3. 兄弟节点

      from bs4 import BeautifulSoup
      
      html = """
      <html>
       <head>
        <title>
         The Dormouse's story
        </title>
       </head>
       <body>
        <p class="story">
         Once upon a time there were three little sisters; and their names were
         <a class="sister" href="http://example.com/elsie" id="link1">
          <span>Elsie</span>
         </a>
         ,
         <a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
         </a>
         and
         <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
         </a>
         ;
      and they lived at the bottom of a well.
        </p>
        <p class="story">
         ...
        </p>
       </body>
      </html>
      """
      
      soup = BeautifulSoup(html, 'lxml')
      print(
          # 获取下一个兄弟元素
          {'Next Sibling': soup.a.next_sibling},
          # 获取上一个兄弟元素
          {'Previous Sibling': soup.a.previous_sibling},
          # 返回后面的兄弟元素
          {'Next Siblings': list(enumerate(soup.a.next_siblings))},
          # 返回前面的兄弟元素
          {'Previous Siblings': list(enumerate(soup.a.previous_siblings))},
      
          sep='\n\n'
      )
      
      
      # 输出:
      {'Next Sibling': '\n   ,\n   '}
      
      {'Previous Sibling': '\n   Once upon a time there were three little sisters; and their names were\n   '}
      
      {'Next Siblings': [(0, '\n   ,\n   '), (1, <a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
         </a>), (2, '\n   and\n   '), (3, <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
         </a>), (4, '\n   ;\nand they lived at the bottom of a well.\n  ')]}
      
      {'Previous Siblings': [(0, '\n   Once upon a time there were three little sisters; and their names were\n   ')]}
      
    4. 提取信息

      from bs4 import BeautifulSoup
      
      html = """
      <html>
       <body>
        <p class="story">
         Once upon a time there were three little sisters; and their names were
         <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
         <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
        </p>
       </body>
      </html>
      """
      
      soup = BeautifulSoup(html, 'lxml')
      print(
          'Next Sibling:',
      
          [soup.a.next_sibling],        # 获取上一个兄弟节点
          # \n
          type(soup.a.next_sibling),      # 上一个兄弟节点的类型
          # <class 'bs4.element.NavigableString'>
          [soup.a.next_sibling.string],     # 获取上一个兄弟节点的内容
          # \n
          sep='\n'
      )
      
      print(
          'Parent:',
      
          [type(soup.a.parents)],      # 获取所有的祖先节点
          # <class 'generator'>
          [list(soup.a.parents)[0]],           # 获取第一个祖先节点
          # <p class="story">
         Once upon a time there were three little sisters; and their names were
         <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      </p>
          [list(soup.a.parents)[0].attrs['class']],        # 获取第一个祖先节点的"class属性"的值
          # ['story']
          sep='\n'
      )
      
      # 为了输出返回的结果,均以列表形式
      
      
      # 输出:
      Next Sibling:
      ['\n']
      <class 'bs4.element.NavigableString'>
      ['\n']
      Parent:
      [<class 'generator'>]
      [<p class="story">
         Once upon a time there were three little sisters; and their names were
         <a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
      <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
      </p>]
      [['story']]
      
  4. 嵌套选择

    from bs4 import BeautifulSoup
    
    html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.head.title)
    print(type(soup.head.title))
    print(soup.head.title.string)
    
    # 输出:
    <title>The Dormouse's story</title>
    <class 'bs4.element.Tag'>
    The Dormouse's story
    

方法选择器

find_all(name=None, attrs={}, recursive=True, text=None, limit=None)
  1. 查询所有符合条件的元素

    from bs4 import BeautifulSoup
    
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find_all(name='li'),
          type(soup.find_all(name='li')[0]),
          sep='\n\n')
    
    
    # 输出:
    [<li class="item-O"><a href="linkl.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html">third item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a>
    </li>]
    
    <class 'bs4.element.Tag'>
    
    
    # 返回值是一个列表,列表的元素是名为"li"的节点,每个元素都是bs4.element.Tag类型
    
    
    # 遍历每个a节点
    from bs4 import BeautifulSoup
    
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    li = soup.find_all(name='li')
    
    for a in li:
        print(a.find_all(name='a'))
    
    # 输出:
    [<a href="linkl.html">first item</a>]
    [<a href="link2.html">second item</a>]
    [<a href="link3.html">third item</a>]
    [<a href="link4.html">fourth item</a>]
    [<a href="link5.html">fifth item</a>]
    
  2. attires 参数

    from bs4 import BeautifulSoup
    
    html = """
    <div>
    <ul>
    <li class="item-O"><a href="linkl.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a>
    </ul>
    </div>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    
    print(soup.find_all(attrs={'class': 'item-0'}))
    print(soup.find_all(attrs={'href': 'link5.html'}))
    
    
    # 输出:
    [<li class="item-0"><a href="link5.html">fifth item</a>
    </li>]
    [<a href="link5.html">fifth item</a>]
    
    # 可以通过attrs参数传入一些属性来进行查询,即通过特定的属性来查询
    # find_all(attrs={'属性名': '属性值', ......})
    
  3. text 参数

    from bs4 import BeautifulSoup
    import re
    
    html = """
    <div class="panel">
    <div class="panel-body">
    <a>Hello, this is a link</a>
    <a>Hello, this is a link, too</a>
    <div/>
    <div/>
    """
    
    soup = BeautifulSoup(html, 'lxml')
    
    # 正则表达式规则对象
    regular = re.compile('link')
    
    # text参数课用来匹配节点的文本,传入的形式可以是字符串,也可以是正则表达式对象
    print(soup.find_all(text=regular))
    
    # 正则匹配输出
    print(re.findall(regular, html))
    
    
    # 输出:
    ['Hello, this is a link', 'Hello, this is a link, too']
    ['link', 'link']
    

说明:

find(name=None, attrs={}, recursive=True, text=None)
# 仅返回与给定条件匹配标记的第一个元素

CSS选择器

  1. Beautiful Soup 提供了CSS选择器,调用select()方法即可

  2. css选择器用法:http://www.w3school.com.cn/cssref/css_selectors.asp

  3. 方法

    select(selector, namespaces=None, limit=None)
    
  4. 简单实例

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    
    ul_all = soup.select('ul')
    print(ul_all)
    
    for ul in ul_all:
        print()
        print(
            ul['id'],
    
            ul.select('li'),
    
            sep='\n'
        )
    
    
    # 输出:
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    
    list-1
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    
    list-2
    [<li class="element">Foo</li>, <li class="element">Bar</li>]
    
  5. 获取属性

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    
    ul_all = soup.select('ul')
    print(ul_all)
    
    for ul in ul_all:
        print()
        print(
            ul['id'],
    
            ul.attrs['id'],
    
            sep='\n'
        )
    
    # 直接传入中括号和属性名  或者  通过attrs属性获取属性值 都可以成功获得属性值
    
    # 输出:
    [<ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>, <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>]
    
    list-1
    list-1
    
    list-2
    list-2
    
  6. 获取文本

    html = '''
    <div class="panel">
    <div class="panel-heading">
    <h4>Hello</h4>
    </div>
    <div class="panel-body">
    <ul class="list" id="list-1">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
    <li class="element">Foo</li>
    <li class="element">Bar</li>
    </ul>
    </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    
    ul_all = soup.select('li')
    print(ul_all)
    
    for li in ul_all:
        print()
        print(
            'get_text()方法获取文本:'+li.get_text(),
    
            'string属性获取文本:'+li.string,
    
            sep='\n'
        )
    
    
    # 输出:
    [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    
    get_text()方法获取文本:Foo
    string属性获取文本:Foo
    
    get_text()方法获取文本:Bar
    string属性获取文本:Bar
    
    get_text()方法获取文本:Jay
    string属性获取文本:Jay
    
    get_text()方法获取文本:Foo
    string属性获取文本:Foo
    
    get_text()方法获取文本:Bar
    string属性获取文本:Bar
    
posted @ 2019-07-10 12:35  LeeHua  阅读(284)  评论(0编辑  收藏  举报