Python爬虫〇六———数据解析之beautifulsoup的使用

我们在上一章讲了最直接的索引方法——正则，今天今天讲一个稍微好用一点的数据解析的方法：beautifulsoup4。bs4是在python中独有的一种解析方式，而前面所讲的正则的解析方法，顾名思义，是基于正则表达式的，所以是不限制编程语言的。

通过bs4进行数据解析的流程

按照前面讲过的数据解析原理，就是定位标签和获取便签或者是标签属性中存储的数据值，按照这个思路，bs4的数据解析的流程是这样的：

实例化一个BeautifulSoup对象，并且将页面的源码的数据加载到该对象中。
通过调用BeautifulSoup对象中相关属性和方法进行标签定位和数据提取

bs4环境安装

bs4的安装可以使用pip直接安装，安装后还需要安装一个lxml解析器

pip install bs4 
pip install lxml

在安装过程中可以用-i指定国内的源。

BeautifulSoup支持Python标准库中的HTML解析器，还支持了几种第三方解析器。下面的表格讲的是各种第三方解析气的特点

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

BeautifulSoup的实例化

BeautifulSoup的实例化有两种情况，一个是加载本地的html文档数据，还有一种是加载爬取网上数据。

加载本地html文件

先写一个简单的html文件供后面的案例使用（文件名test.html）

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

实例化本地文件袋方法有两种

方式1

直接使用文件句柄

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    print(soup.title)
soup = BeautifulSoup(data,'lxml')
print(soup.title)

##########输出##########
<title>
   The Dormouse's story
  </title>

这里用了一个bs4的属性，获文档的head

方式2

第二种方式是先读取文档，再实例化

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    data = f.read()
soup = BeautifulSoup(data,'lxml')
print(soup.title)

##########输出##########
<title>
   The Dormouse's story
  </title>

两种方法效果一样，具体使用哪种看个人喜好。

加载爬取内容

爬取内容的加载和前面的第二个方式一样，通过requests模块get到html数据以后直接实例化就行了。

BeautifulSoup对象的处理

BeautifulSoup对象的处理是这一节要讲到重点，还是对上面那个test.html文件来演示，如何通过对数据的解析来了解BeautifulSoup的常规使用方法

在实例化过程中，BeautifulSoup将复杂的HTML文档转换成一个树形结构，树的每一个节点都是一个Python对象，所有的对象都可以归纳为四中

Tag
Navigablestring
BeautifulSoup
Comment

Tag

tag和HTML里的一样，前面的案例中的.title已经用过一次了，可以通过.tag的方式获取到soup对象中的第一个符合要求的tag。tag有很多属性和方法，在遍历文档和搜索文档中会详细讲到，这里主要讲一个，获取tag到属性attributes

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag = soup.a
    print(tag.attrs)

##########输出##########
{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}

因为一个标签是可以包含多个属性的，获取到属性是一个字典，如果我们想要获取指定的属性，比如class，就可以用字典的方式(['class'])拿到所需的对象。

多值属性

在HTML5中有些tag的属性是支持多个值的，最常见的就是class属性，那么这个时候返回的就是一个list，即便属性内只有一个值(就像前面的class只有一个sister)返回值也是一个list。

如果某个属性看起来像是有多个值，但在各个版本的HTML中都没有定义为多值属性，那么BeautifulSoup就会把这个值作为一个字符串返回

from bs4 import BeautifulSoup
import requests

st = '<div id="c1 c2">123</div>'
soup = BeautifulSoup(st,'lxml')

tag = soup.div
print(tag.attrs['id'])
##########输出##########
c1 c2

因为id是只有一个值得，所以即便看起来是用空格分割开，返回值也是一个整体的字符串

还有一种情况，是如果我们如果指定xml作为解析器，多值属性会被合并成一个字符串输出

from bs4 import BeautifulSoup
import requests

st = '<div class="c1 c2">123</div>'
soup = BeautifulSoup(st,'xml')

tag = soup.div
print(tag.attrs['class'])

##########输出##########
c1 c2

注意上面在实例化的时候，我指定xml作为解析器。

搜索文档树

因为我们使用bs4最常用的环境就是解析数据，所以对文档树进行搜索是最常用的功能。其中最常用的搜索方法有两种

find()
find_all()

至于其他的方法，参数和用法都类似，举一反三即可

find_all()的使用

find_all（）可以查到文档的内容，但是根据不同的参数有不同的效果

字符串

直接给个字符串，一般都是标签类型，就可以以列表的形式返回所有该类型的标签以及内部内容。还以以前面的html页面为例。

from bs4 import BeautifulSoup
import requests

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    
    print(soup.find_all('a'))

##########输出##########

[<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>]

在上面的例子中，我用find_all来查找所有<a>标签。

正则表达式

在参数中传入正则表达式，BeautifulSoup会按照正则表达式的match()来匹配响应的内容

from bs4 import BeautifulSoup
import re

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    
    tags = soup.find_all(re.compile('d'))
    for tag in tags:
        print(tag.name)
##########输出##########
head
body

上面的代码就是用来获取文档中包含d标签的标签名。

列表

在参数中传入列表，只要匹配列表中任意元素，就将其内容返回

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    print(soup.find_all(['a','b']))
##########输出##########
[<b>
    The Dormouse's story
   </b>, <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>, <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>, <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>]

上面的案例就是获取所有a标签或者b标签的内容。

Ture

用True可以匹配任何值，可以查到所有的tags

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tags = soup.find_all(True)
    for tag in tags:
        print(tag.name)

##########输出##########
html
head
title
body
p
b
p
a
a
a
p

方法

我们还可以将一个方法传入，注意该方法只能接受一个tag作为参数。如果方法返回值为ture则表示匹配，否则返回false

比如我们要找到同时包含class和id两个属性的tag，所以要先定义一个方法，然后把这个方法作为参数传过去

from bs4 import BeautifulSoup

def has_class_and_id(tag):
    return tag.has_attr('class') and tag.has_attr('id')

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tags = soup.find_all(has_class_but_no_id)
    for tag in tags:
        print(tag）
##########输出##########
<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
<a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
<a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>

在方法中校验了tag是否具有class和id两个属性，如果有，就返回该tag，否则匹配失败。

find_all()还有一些用法的细节，我放在下面讲find()方法的时候讲，二者是一样的,区别就是find_all返回的是一个列表，而find返回的是第一个匹配出来的元素。在没有匹配到对应元素的时候，find_all返回一个空的列表，而find返回值None。

find()方法的使用

find()方法里可以放下面的参数

find(name,attrs,recursive,string.**kwargs)

下面一个个来讲

name参数

直接的name

name参数用于查询名字为name的tag，同样，name还可以使用上面所说的任意一种filter。上面的众多例子都是这种用法，不再举例说明。

keyword参数

如果指定的名字不是搜索的内置的参数名，搜索的时候会吧该参数当做指定名字属性来搜索，比方id，href。

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag1 = soup.find(href='http://example.com/elsie')
    print('tag1\n',tag1)
    
    tag2 = soup.find(id='link2')
    print('tag2\n',tag2)

##########输出##########
tag1
 <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
tag2
 <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>

基于CSS的搜索(class)

搜索css的时候，由于要用到的class是Python中的关键字，所以要用class_来代替

from bs4 import BeautifulSoup

with open('./test.html','r',encoding='utf-8') as f:
    soup = BeautifulSoup(f,'lxml')
    tag = soup.find(class_='sister')
    print(tag)

##########输出##########
<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>

这个方法使用会非常频繁，一定要注意！

name的参数可以配合使用，比如我们想要解析到a标签里id值为link1的标签

soup.find('a',id='link1')

这样的方法是可以的。

string参数

string参数用来搜索文档中匹配到字符串内容，我在这里遇到一个坑：直接使用string=的时候，要完全匹配才可以，包含了换行符的是不行的！

选择器的使用

BeautifulSoup还提供了一个.select方法作为选择器，这个选择器可以简单的作为id/class标签/等各种选择器使用，还可以用作层级选择器

tag = soup.select('.story>a')

上面的代码就是搜索class为story下的所有a标签，注意返回的是一个列表。

层级选择器的使用

便于掩饰，写一个简单的代码,下面的代码里a标签忘记闭合了，不影响效果

s = """<div class='test'>
     <ul>
      <li><span><a>1</span></li>
      <li><span><a>2</span></li>
      <li><span><a>3</span></li>
    </ul>
    </div>
    """
soup = BeautifulSoup(s)

注意层级关系

大于号>表示一个层级，

soup.select('.test>ul>li>span')
##########输出##########
[<span><a>1</a></span>, <span><a>2</a></span>, <span><a>3</a></span>]

空格表示间隔多个层级

此外还有一些别的用法

通过是否存在某个属性来找

soup.select(a['href'])

就是查带有href属性的a标签

通过属性的值来找

tag = soup.select('a[id="link2"]')

获取标签键的文本数据

在了解了上面的方法后我们就可以按要求定位到需要的标签，下面就要获取标签内的文本数据，这里有两个用法

soup.text
soup.string
soup.get_text()
contents

假设我们现在有一段html代码

from bs4 import BeautifulSoup

s = """<div class='test'>
    <span>span标签
        <a>a标签内</a>
    </span>
    </div>
    """
soup = BeautifulSoup(s)

来讲一下上面几种方法的区别

text可以获取标签下面所有的文本内容，返回值为字符串

tag = soup.select_one('span')
print('txt',tag.text)
print('string',tag.string)
##########输出##########
txt span标签
        a标签内

string None

string返回值为none，因为string的返回值是一个Navigablestring，当一个tag内有多个节点存在，string方法是不知道调哪个，所以就会返回一个None。当tag里的节点唯一时就会返回一个值

soup = BeautifulSoup(s)

tag = soup.select_one('a')
print('txt',tag.text)
print('string',tag.string)
print(type(tag.string))
##########输出##########
txt a标签内
string a标签内
<class 'bs4.element.NavigableString'>

其中.text属性和get_text()方法的效果是一样的。

但是有些情况我们指向获取到第一个层级里的内容，用text显然是不方便的，这时候就用到最后一个属性了

tag = soup.select_one('span')
print(tag.contents[0])
##########输出##########
span标签

contents主要是用于讲tag里的子节点以列表的方式输出，这里使用的方法不是其主要功能。

bs4为我们提供了有一个NavigableString类，可以对字符串进行一些操作，这里不在过多说明，可以看官网上的讲解。

使用案例

在大致了解了 bs4的使用方法后，通过两个案例来试一下。

爬取三国演义内容

需求：爬取三国演义小说所有的章节标题和章节内容

url:https://www.shicimingju.com/book/sanguoyanyi.html

下面我们就一步步来试一下，先看一下原页面的html

其实我们就要定位到这个a标签里的链接和后面的章节名称就行了。试一下怎么拿到这些数据

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    print(title_tags)

这时候发现一个问题，爬取出来的中文都是乱码，下面是列表的前两个元素

<a href="/book/sanguoyanyi/1.html">ç¬¬ä¸åÂ·å®´æ¡åè±ªæ°ä¸ç»ä¹  æ©é»å·¾è±éé¦ç«å</a>, 
<a href="/book/sanguoyanyi/2.html">ç¬¬äºåÂ·å¼ ç¿¼å¾·æéç£é®    ä½å½èè°è¯å®¦ç«</a>

看一下页面的html源码,编码是utf8,那是为什么呢？我们可以打印一下resp的响应类型

print(resp.encoding)
##########输出##########
ISO-8859-1

我们在用resp.text属性后，返回的是一个经过unicoding后的数据。那么怎么转换成utf-8呢？

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8' #修改编码类型
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    print(title_tags[0])
##########输出##########
<a href="/book/sanguoyanyi/1.html">第一回·宴桃园豪杰三结义  斩黄巾英雄首立功</a>

这样就好了，注意修改编码类型的方法，是一个赋值语句而不是调用的方法，经过指定的编码转变后，拿到的数据就正常了。看下定位a标签的方法，是不是比较简单。

为了爬取每章节的具体内容，这里定义一个字典，每个键值对就存章节名称和对应的链接就可以。注意点是href里的链接是一个相对路径，要加上'https://www.shicimingju.com'。

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    
    article_dic = {}
    for title in title_tags:
        article_dic[title.text]='https://www.shicimingju.com'+title['href']
        
    print(article_dic)

上面的代码就是生成字典的过程。下面就要遍历字典，爬取相应的数据即可

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    
    resp = requests.get(url=url)
    resp.encoding = 'utf-8'
    soup = BeautifulSoup(resp.text,'lxml')
    
    title_tags = soup.select('.book-mulu>ul a')
    
    article_dic = {}
    for title in title_tags:
        article_dic[title.text]='https://www.shicimingju.com'+title['href']
        
        
    with open('三国演义.txt','w',encoding='utf-8') as f:
        for key in article_dic:
            f.write(key)
            url = article_dic[key]
            article_page = requests.get(url = url)
            article_page.encoding = 'utf-8'
            soup = BeautifulSoup(article_page.text,'lxml')
            article = soup.select_one('.chapter_content').text
            f.write(article)
            print(key,'finish')

整个流程就完成了！

posted @ 2021-02-21 01:30 银色的音色阅读(2263) 评论(0) 编辑收藏举报

刷新页面返回顶部

银色的音色

Python爬虫〇六———数据解析之beautifulsoup的使用

公告