Python 爬虫初探

准备部分

0x01 爬虫的简介和价值

a. 简介

自动抓取互联网数据的程序，是基础技术之一

b. 价值

快速提取网络中有价值的信息

0x02 爬虫的开发环境

a. 环境清单

Python3.7
开发环境：Mac、Windows、Linux
编辑器：Pycharm
网页下载：requests(2.21.0)
网页解析：BeautifulSoup/bs4(4.11.2)
动态网页下载：Selenium(3.141.0)

b. 环境测试

新建一个 Python 软件包，命名为 test
在上述软件包中新建一个 Python 文件，命名为 test_env

测试代码如下

import requests
from bs4 import BeautifulSoup
import selenium

print("OK!")

如果成功输入OK!则说明测模块安装成功

基础部分

0x03 简单的爬虫架构和执行流程

爬虫调度端（启动、停止）
爬虫架构（三大模块）

graph LR A(URL 管理器)--URL-->B(网页下载器) B--HTML-->C(网页解析器) C-.URL.->A
1. URL 管理器
  
  URL 对管理，防止重复爬取
2. 网页下载器
  
  网页内容下载
3. 网页解析器
  
  提取价值数据，提取新的待爬 URL
价值数据

0x04 URL 管理器

a. 介绍

作用：对爬取的 URL 进行管理，防止重复和循环爬取
对外接口
- 取出一个待爬取的 URL
- 新增待爬取的 URL
实现逻辑
- 取出时状态变成已爬取
- 新增时判断是否已存在
数据存储
- Python 内存
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- redis
  - 待爬取 URL 集合：set
  - 已爬取 URL 集合：set
- MySQL
  - urls(url, is_crawled)

b. 代码实现

新建一个 Python 软件包，命名为 utils
在上述软件包中新建一个 Python 文件，命名为 url_manager

由于需要对外暴露接口，需要封装成类，代码如下：

class UrlManager():
    """
    url 管理器
    """

    # 初始化函数
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    # 新增 URL
    def add_new_url(self, url):
        # 判空
        if url is None or len(url) == 0:
            return
        # 判重
        if url is self.new_urls or url in self.old_urls:
            return
        # 添加
        self.new_urls.add(url)

    # 批量添加 URL
    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    # 获取一个新的待爬取 URL
    def get_url(self):
        if self.has_new_url():
            url = self.new_urls.pop()
            self.old_urls.add(url)
            return url
        else:
            return None

    # 判断是否有新的待爬取的 URL
    def has_new_url(self):
        return len(self.new_urls) > 0


# 测试代码
if __name__ == "__main__":
    url_manager = UrlManager()

    # URL 添加测试
    url_manager.add_new_url("url1")
    url_manager.add_new_urls(["url1", "url2"])
    print(url_manager.new_urls, url_manager.old_urls)

    # URL 获取测试
    print("=" * 20)   # 分割线
    new_url = url_manager.get_url()
    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)
    new_url = url_manager.get_url()
    print(url_manager.new_urls, url_manager.old_urls)

    print("=" * 20)
    print(url_manager.has_new_url())

0x05 网页下载器(requests)

a. 介绍

网址：python-requests
安装：pip install requests
介绍：

Requests is an elegant and simple HTTP library for Python, built for human beings.

Requests 是一个优雅的、简单的 Python HTTP 库，常常用于爬虫中对网页内容的下载
执行流程

graph LR A(Python程序<br/>requests 库)--request-->B(网页服务器) B--respone-->A

b. 发送 request 请求

request.get/post(url, params, data, headers, timeout, verify, allow_redirects, cookies)

url：要下载的目标网页的 URL
params：字典形式，设置 URL 后面的参数，如：?id=123&name=xxx
data：字典或者字符串，一般用于使用 POST 方法时提交数据
headers：设置user-agent、refer等请求头
timeout：超时时间，单位是秒
verify：布尔值，是否进行 HTTPS 证书认证，默认 True，需要自己设置证书地址
allow_redirects：布尔值，是否让 requests 做重定向处理，默认 True
cookies：附带本地的 cookies 数据

url、data、headers、timeout为常用参数

c. 接收 response 响应

res = requests.get/post(url)

res.status_code：查看状态码
res.encoding：查看当前编码以及变更编码

（requests 会根据请求头推测编码，推测失败则采用ISO-8859-1进行编码）
res.text：查看返回的网页内容
res.headers：查看返回的 HTTP 的 Headers
res.url：查看实际访问的 URL
res.content：以字节的方式返回内容，比如下载图片时
res.cookies：服务端要写入本地的 cookies 数据

d. 使用演示

在 cmd 中安装 ipython，命令为：python -m pip install ipython

在 cmd 中启动 ipython，命令为：ipython

In [1]: import requests
In [2]: url = "https://www.cnblogs.com/SRIGT"
In [3]: res = requests.get(url)
In [4]: res.status_code
Out[4]: 200
In [5]: res.encoding
Out[5]: 'utf-8'
In [6]: res.url
Out[6]: 'https://www.cnblogs.com/SRIGT'

0x06 网页解析器(BeautifulSoup)

a. 介绍

网址：Beautiful Soup: We called him Tortoise because he taught us.
安装：pip install beautifulsoup4
介绍：Python 第三方库，用于从 HTML 中提取数据
使用：import bs4或from bs4 import BeautifulSoup

b. 语法

graph LR HTML网页-->A(创建 BeautifulSoup 对象) A-->B(搜索节点<br/>find_all, find) B-.->B1(按节点名称) B-.->B2(按节点属性值) B-.->B3(按节点文字) B-->C(访问节点<br/>名称, 属性, 文字)

创建 BeautifulSoup 对象

from bs4 import BeautifulSoup

# 根据 HTML 网页字符串创建 BeautifulSoup 对象
soup = BeautifulSoup(
    html_doc,				# HTML 文档字符串
    'html.parser',			 # HTML 解析器
    from_encoding = 'utf-8'	 # HTML 文档的编码
)

搜索节点

# find_all(name, attrs, string)
# 查找所有标签为 a 的节点
soup.find_all('a')

# 查找所有标签为 a，链接符合 /xxx/index.html 形式的节点
soup.find_all('a', href='/xxx/index.html')

# 查找所有标签为 div，class 为 abc，文字为 python 的节点
soup.find_all('div', class_='abc', string='python')

访问节点信息

# 得到节点： <a href='1.html'>Python</a>
# 获取查找到的节点的标签名称
node.name
# 获取查找到的 a 节点的 href 属性
node['href']
# 获取查找到的 a 节点的链接文字
node.get_text()

c. 使用演示

目标网页

<html>
    <head>
        <meta charset="utf-8">
        <title>页面标题</title>
    </head>
    <body>
        <h1>标题一</h1>
        <h2>标题二</h2>
        <h3>标题一</h3>
        <h4>标题一</h4>
        <div id="content" class="default">
            <p>段落</p>
            <a href="http://www.baidu.com">百度</a>
            <a href="http://www.cnblogs.com/SRIGT">我的博客</a>
        </div>
    </body>
</html>

测试代码

from bs4 import BeautifulSoup

with open("./test.html", 'r', encoding='utf-8') as fin:
    html_doc = fin.read()

soup = BeautifulSoup(html_doc, "html.parser")
div_node = soup.find("div", id="content")
print(div_node)
print()

links = div_node.find_all("a")
for link in links:
    print(link.name, link["href"], link.get_text())

img = div_node.find("img")
print(img["src"])

实战部分

0x07 简单案例

url = "http://www.crazyant.net/"

import requests
r = requests.get(url)
if r.status_code != 200:
    raise Exception()

html_doc = r.text

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")

h2_nodes = soup.find_all("h2", class_="entry-title")

for h2_node in h2_nodes:
    link = h2_node.find("a")
    print(link["href"], link.get_text())

0x08 爬取所有博客页面

根域名：蚂蚁学Python

文章页 URL 形式：PyCharm开发PySpark程序的配置和实例 – 蚂蚁学Python

requests 请求时附带 cookie 字典

import requests
cookies = {
    "captchaKey": "14a54079a1",
    "captchaExpire": "1548852352"
}
r = requests.get(
    "http://url",
    cookies = cookies
)

正则表达式实现模糊匹配

url1 = "http://www.crazyant.net/123.html"
url2 = "http://www.crazyant.net/123.html#comments"
url3 = "http://www.baidu.com"

import re
pattern = r'^http://www.crazyant.net/\d+.html$'

print(re.match(pattern, url1))
print(re.match(pattern, url2))
print(re.match(pattern, url3))

全页面爬取

from utils import url_manager
from bs4 import BeautifulSoup
import requests
import re

root_url = "http://www.crazyant.net"

urls = url_manager.UrlManager()
urls.add_new_url(root_url)

file = open("craw_all_pages.txt", "w")
while urls.has_new_url():
    curr_url = urls.get_url()
    r = requests.get(curr_url, timeout=3)
    if r.status_code != 200:
        print("error, return status_code is not 200", curr_url)
        continue
    soup = BeautifulSoup(r.text, "html.parser")
    title = soup.title.string

    file.write("%s\t%s\n" % (curr_url, title))
    file.flush()
    print("success: %s, %s, %d" % (curr_url, title, len(urls.new_urls)))

    links = soup.find_all("a")
    for link in links:
        href = link.get("href")
        if href is None:
            continue
        pattern = r'^http://www.crazyant.net/\d+.html$'
        if re.match(pattern, href):
            urls.add_new_url(href)

file.close()

0x09 爬取豆瓣电影Top250

❗目前该榜单设置了反爬❗

步骤：

使用 requests 爬取网页

使用 BeautifulSoup 实现数据解析

借助 pandas 将数据写到 Excel

调用

import requests
from bs4 import BeautifulSoup
import pandas as pd

下载共 10 个页面的 HTML

# 构造分页数字列表
page_indexs = range(0, 250, 25)
list(page_indexs)

def download_all_htmls():
    """
    下载所有列表页面的 HTML，用于后续的分析
    """
    htmls = []
    for idx in page_indexs:
        url = f"https://movie.douban.com/top250?start={idx}&filter="
        print("craw html: ", url)
        r = requests.get(url)
        if r.status_code != 200:
            raise Exception("error")
        htmls.append(r.text)
    return htmls

# 执行爬取
htmls = download_all_htmls()

解析 HTML 得到数据

def parse_single_html(html):
    """
    解析单个 HTML，得到数据
    @return list({"link", "title", [label]})
    """
    soup = BeautifulSoup(html, 'html.parser')
    article_items = (
        soup.find("div", class_="article")
            .find("ol", class_="grid_view")
            .find_all("div", class_="item")
    )
    datas = []
    for article_item in article_items:
        rank = article_item.find("div", class_="pic").find("em").get_text()
        info = article_item.find("div", class_="info")
        title = info.find("div", class_="hd").find("span", class_="title").get_text()
        stars = (
            info.find("div", class_="bd")
                .find("div", class_="star")
                .find_all("span")
        )
        rating_star = stars[0]["class"][0]
        rating_num = stars[1].get_text()
        comments = stars[3].get_text()

        datas.append({
            "rank": rank,
            "title": title,
            "rating_star": rating_star.replace("rating", "").replace("-t", ""),
            "rating_num": rating_num,
            "comments": comments.replace("人评价", "")
        })
    return datas


pprint.pprint(parse_single_html(htmls[0]))

all_datas = []
for html in htmls:
    all_datas.extend(parse_single_html(html))

print(all_datas)

将结果存入 Excel

df = pd.DataFrame(all_datas)
df.to_excel("TOP250.xlsx")

End

posted @ 2023-03-09 16:01 SRIGT 阅读(76) 评论(0) 收藏举报

刷新页面返回顶部