[notes] Python网络爬虫1（基于第三方库requests）

随笔

参考Bilibili Up主 @python学习者的教学视频，学习了如何进行网络爬虫，并且尝试爬取了人民数据网上2020年1月1日至今的全部新闻。

URL
1. 网址；统一资源定位符
2. 每个网络资源都有对应的一个URL
网络爬虫 WEB CRAWLER
1. 模拟浏览器自动地浏览网页
2. 自动、批量采集所需资源（文本、视频、图片、文件、音频）
3. 一段脚本
爬虫流程
1. 向浏览器输入URL，游览器解析出IP，浏览器向服务器发出众多http请求。服务器返回http响应（返回的是html源码），浏览器进行解析渲染呈现给用户。
2. 爬虫工程师要对前端熟悉，看得懂JavaScript，CSS，HTML
3. 程序模拟浏览器，拿到URL后发出http请求，收获http响应
4. 403时加一个伪装头（header）
5. Request Method: get / post
开发流程
1. 明确目标数据：网站
2. 分析数据加载流程：数据对应的URL
3. 下载数据
  1. 数据清洗
  2. 数据持久化

2. 初试

爬取流程

进入循环
1. 发送http请求（某日的新闻聚合页面，url_cur），获得服务器响应；
2. 将服务器响应转化成文本，并进行匹配，得到当日页面上全部新闻的url；
  1. 进入循环，逐条请求当日新闻的url；
    1. 获得服务器响应，将html转化成文本；
    2. 匹配标题、作者、正文；
    3. 对正文进行数据清理；
    4. 将内容写入文本文件中；
3. 将下一天新闻聚合页面的url设置为url_cur；
4. 直到下一天新闻聚合页面的url不可用，跳出循环。

代码

import requests
import re

# 爬取人民数据网2020-01-01至2020-02-29的全部新闻
# 初始url指向2020-01-01的网页
url_cur = 'http://data.people.com.cn/rmrb/20200101/1'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36"}

while 1:
    # 打开文件
    f = open('crawled news.txt', 'a')

    # 不加header伪装成客户端，会报403 forbidden
    response = requests.get(url_cur, headers=headers, timeout=30)
    response.encoding = 'utf-8'
    html = response.text

    # 时间
    date = url_cur.split('/')[-2]
    weekday = re.findall(r'<span id="UseWeek">(.*?)</span><br/>', html)[0]
    datetime = date[:4] + '年' + date[4:6] + '月' + date[6:] + '日' + '，' + weekday
    f.write(datetime)
    f.write('\n')

    # 获取当天文章信息集合
    article_info = re.findall(r'<a title="(.*?)" href="(.*?)">.*?</a>', html)
    num_article = len(article_info)
    try:
        url_next = re.findall(r'<li><a href="(.*?)" class="tool_btn_li xq ">下一期</a></li>', html)[0]
        url_next = ''.join(['http://data.people.com.cn', url_next])
    except IndexError:
        break

    # 循环进入每篇新闻爬取内容
    for id, (_, href) in enumerate(article_info):
        article_url = ''.join([url_cur, '/', href.split('/')[-1]])

        # 发出http请求
        article_response = requests.get(article_url, headers=headers, timeout=30)
        article_response.encoding = 'utf-8'
        article_content = article_response.text

        # 匹配信息：标题，作者，正文
        title = re.findall(r'<div class="title">(.*?)</div>', article_content)[0]
        try:
            author = re.findall(r'<div class="author">(.*?)</div>', article_content)[0]
        except IndexError:
            author = 'UNKNOWN'
        body_list = re.findall(r'<p>(.*?)</p>', article_content, re.S)  # 不加re.S，匹配不到内容

        # 数据清洗
        body = '<p>' + '</p><p>'.join(body_list[1:]) + '</p>'
        body = body.replace(' ', '')
        body = body.replace('\u3000', '')
        body = body.replace('\r', '')
        body = body.replace('\n', '')
        body = body.replace('\t', '')
        body = body.replace('&nbsp;', '')
        body = body.replace('&rdquo;', '’')
        body = body.replace('&ldquo;', '‘')
        body = body.replace('&hellip;', '…')

        f.write('[{}/{}]'.format(id+1, num_article))
        f.write(title)
        f.write(author)
        f.write(body)
        f.write('\n')

    # 更新文件，跳转到下一天的页面
    f.close()
    url_cur = url_next

print('done!')

注意事项
1. 网站会检测到发出的http申请是否来自浏览器，如果被识别到，就会返回403。因此需要加上Header来伪装。
2. 部分文章不包含作者信息，进行匹配的时候需要用Try…except语句来避免报错。
3. 清洗数据的时候，注意处理不间断空白符&nbsp，用于缩进的全角空白符\u3000，以及其他全角标点。
4. 注意将服务器响应的编码格式转化为utf-8，否则会出现乱码。
5. 匹配页面html内容（超文本标记语言）的时候，注意正则表达式的匹配规则。
6. re.findall()返回一个列表，就算只有一个元素也是返回一个列表。
爬取结果

posted on 2020-02-29 16:30 LAUSpectrum 阅读(275) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部