GB标准文档爬虫下载程序

背景

近期想下载几篇软件测试方面的国标文件,找了好久找了一个网站https://0ppt.com
不过发现这个网站的国标文件的搜索好像不能用,没办法就想着直接用爬的方法吧。

思路

分析网站结构

F12开发工具分析请求

本以为页面上会有接口来返回数据,但找了半天发现什么包含内容的接口都没有,最后通过定位页面元素发现想要的信息就在html内容里。

确定方案

既然内容就在html内容中,那直接请求对应页面的地址拿到返回即可,再去分析页面中想要的内容。

具体实现

先导入必要模块

import requests  --主要是请求关键数据
import re  --主要用来匹配关键信息
from time import sleep  -- 本想是防止爬的太快
from bs4 import BeautifulSoup  -- 本来想通过BS4来获取信息的,最后发现还没正则来的方便
import shelve  --主要是做本地缓存,方便重试的时候快速跳过已经处理的数据

构造请求headers

主要是爬网站直接拒绝请求了,最后发现网站不用登录,也没有做任何限制,实际是可以不用要的,这里构造这个json是用的大模型帮忙生成的,相当nice。

headers = {
  "Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
  # "Accept-Encoding": "gzip, deflate, br, zstd",
  "Accept-Language": "zh-CN,zh;q=0.9",
  "Connection": "keep-alive",
  "Cookie": "HMACCOUNT_BFESS=D1D8258E03E0A558; BDUSS_BFESS=dKNzJKaFhQQWFvcEpSZG9oRE5YR0Zod1l-VHE3ZVFLfnJTZWNJT3JKbGdiT3BsRVFBQUFBJCQAAAAAAAAAAAEAAAB~qcIBZmxvd2ludGhlcml2ZXIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGDfwmVg38JlMT; H_WISE_SIDS_BFESS=40008_40206_40212_40215_40080_40364_40352_40367_40374_40401_40311_40301_40467_40461_40471_40456_40317; BAIDUID_BFESS=A6E2AF276F85EFFB50804B65078FB44D:FG=1; ZFY=hyR2bKIUFoz76hVFPIVRUUHYScV4SOFL0yQP0ASJu4k:C",
  # "Host": "hm.baidu.com",
  "Referer": "https://0ppt.com/",
  # "Sec-Ch-Ua": "\"Chromium\";v=\"124\", \"Microsoft Edge\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
  # "Sec-Ch-Ua-Mobile": "?0",
  # "Sec-Ch-Ua-Platform": "\"Windows\"",
  # "Sec-Fetch-Dest": "image",
  # "Sec-Fetch-Mode": "no-cors",
  # "Sec-Fetch-Site": "cross-site",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
}

实现获取文件列表的方法

  • 这里的正则匹配其实比较简单,直接找到想要的内容区域,然后把自己想要的内容做好匹配;

  • 在实际返回数据中发现中文都是乱码,这里用到了关键的一步: res.encoding = 'GBK' 在返回的结果中,手动复制给其encoding方法赋值GBK,这里最初尝试的‘utf-8’发现不行,换成GBK就可以了。

  • 为什么做了两种匹配,主要是网站写的不够规范,出现了两种结构的内容,需要都进行匹配。

def getcontent_list(html):
    res = requests.get(html,headers=headers)
    res.encoding = 'GBK'
    html_content= res.text

    # print(html_content)

    repat = re.compile(r'<a href="(https.+?)".*?title="(.*?)">.*?</a>')
    repat1 = re.compile(r'<a href="(https.+?)".*?target="_blank">(.*?)</a>')
    result = repat.findall(html_content)
    result1 = repat1.findall(html_content)
    return result + result1

实现获取下载链接的方法

  • 这个主要是根据上面的返回,来找到具体文件的下载链接。
  • 这里发现文件名中会包含一些特殊符号,这些符号会导致文件无法正常保存(windows对文件名会有一些限制),所以做了替换处理。
  • name = re.sub(r'[\\/∕:*]', '-', name)这里应该能想到一些更高效的写法,通过正则就能很好的解决这个问题,这里的\还是需要转义一下
def downloadfiles(fileinfo):
    url = fileinfo[0]
    name = fileinfo[1]
    download_url_info = requests.get(url,headers=headers).text
    # print(download_url_info)
    repat = re.compile(r'<a href="(https.*?)" target="_blank" rel="nofollow" '
                       r'class="bz-down-button">在线预览</a>')
    download_url = repat.findall(download_url_info)
    # name = name.replace('/','-').replace('∕','-').replace(':', '-').replace('*','-')   # 这里因为/在windows下不能用作文件名,所以替换掉
	name = re.sub(r'[\\/∕:*]', '-', name)  # 这种通过正则处理的更加方便,

    return  download_url[0],name

实现下载方法

  • 下载其实就是通过requests来请求下载链接,然后通过二进制写入到对应文件即可。
  • 不过实际写入的时候发现会卡很久,可能是网络原因,还有可能是文件大了,所以这里采用了分块处理的方式,重点就是res = requests.get(url,headers=headers,stream=True)中将stream参数设置成True,然后再写文件时通过分块写入for chunk in res.iter_content(chunk_size=16384): 这里的chunk_size=16384则是分块大小,可以根据实际情况来调整,内存小网络慢可以设置小点,反之可以设置大点,这里的单位是字节,16384=16K
def download_file(url,name):
    with shelve.open('download_list') as f:
        if url in f:
            # print(f'{name}》已经下载过')
            return
    print(f'{name}--开始下载》{url}')
    res = requests.get(url,headers=headers,stream=True)
    if res.status_code == 200:

        with open(f'{name}.pdf','wb') as f:
            for chunk in res.iter_content(chunk_size=16384):  #16384就是16k
                f.write(chunk)
            print(f'{name}下载完成')
            with shelve.open('download_list') as f:
                f[url] = name
    else:
        print(f'{name}下载失败')

完整代码实现

"""
author:babyfengfjx
"""
import requests
import re
from time import sleep
from bs4 import BeautifulSoup
import shelve
headers = {
  "Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
  # "Accept-Encoding": "gzip, deflate, br, zstd",
  "Accept-Language": "zh-CN,zh;q=0.9",
  "Connection": "keep-alive",
  "Cookie": "HMACCOUNT_BFESS=D1D8258E03E0A558; BDUSS_BFESS=dKNzJKaFhQQWFvcEpSZG9oRE5YR0Zod1l-VHE3ZVFLfnJTZWNJT3JKbGdiT3BsRVFBQUFBJCQAAAAAAAAAAAEAAAB~qcIBZmxvd2ludGhlcml2ZXIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGDfwmVg38JlMT; H_WISE_SIDS_BFESS=40008_40206_40212_40215_40080_40364_40352_40367_40374_40401_40311_40301_40467_40461_40471_40456_40317; BAIDUID_BFESS=A6E2AF276F85EFFB50804B65078FB44D:FG=1; ZFY=hyR2bKIUFoz76hVFPIVRUUHYScV4SOFL0yQP0ASJu4k:C",
  # "Host": "hm.baidu.com",
  "Referer": "https://0ppt.com/",
  # "Sec-Ch-Ua": "\"Chromium\";v=\"124\", \"Microsoft Edge\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
  # "Sec-Ch-Ua-Mobile": "?0",
  # "Sec-Ch-Ua-Platform": "\"Windows\"",
  # "Sec-Fetch-Dest": "image",
  # "Sec-Fetch-Mode": "no-cors",
  # "Sec-Fetch-Site": "cross-site",
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0"
}
def getcontent_list(html):
    # https://0ppt.com/bz/index_927.html

    res = requests.get(html,headers=headers)
    res.encoding = 'GBK'
    html_content= res.text

    # print(html_content)

    repat = re.compile(r'<a href="(https.+?)".*?title="(.*?)">.*?</a>')
    repat1 = re.compile(r'<a href="(https.+?)".*?target="_blank">(.*?)</a>')
    result = repat.findall(html_content)
    result1 = repat1.findall(html_content)
    return result + result1

def downloadfiles(fileinfo):
    url = fileinfo[0]
    name = fileinfo[1]
    download_url_info = requests.get(url,headers=headers).text
    # print(download_url_info)
    repat = re.compile(r'<a href="(https.*?)" target="_blank" rel="nofollow" '
                       r'class="bz-down-button">在线预览</a>')
    download_url = repat.findall(download_url_info)
    name = name.replace('/','-').replace('∕','-').replace(':', '-').replace('*','-')   # 这里因为/在windows下不能用作文件名,所以替换掉

    return  download_url[0],name

def download_file(url,name):
    with shelve.open('download_list') as f:
        if url in f:
            # print(f'{name}》已经下载过')
            return
    print(f'{name}--开始下载》{url}')
    res = requests.get(url,headers=headers,stream=True)
    if res.status_code == 200:

        with open(f'{name}.pdf','wb') as f:
            for chunk in res.iter_content(chunk_size=16384):  #16384就是16k
                f.write(chunk)
            print(f'{name}下载完成')
            with shelve.open('download_list') as f:
                f[url] = name
    else:
        print(f'{name}下载失败')

if __name__ == '__main__':
    for base_page in range(45,928):
        htmlbase = f'https://0ppt.com/bz/index_{base_page}.html'
        print(f"当前访问页面:{htmlbase}")
        res = getcontent_list(htmlbase)
        # print(res)
        for i in res:
            # print(i)
            # if "软件" in i[1]:
            download_url,name= downloadfiles(i)
            download_file(download_url,name)
            sleep(1)

并发处理-TODO

posted @ 2024-04-11 11:51  babyfengfjx  阅读(57)  评论(0编辑  收藏  举报