【BOOK】Ajax数据爬取

Requests获取原始HTML文档，Ajax加载和JavaScript处理的数据无法获得

一、Ajax

Ajax—异步的JavaScript和XML

Ajax请求页面更新：

　　1、 发送请求

　　2、 解析内容

　　3、 渲染网页

JavaScript向服务器发送了一个Ajax请求

二、Ajax分析方法

查看Ajax请求

微博(未登录)：https://weibo.com/login.php

下图为Ajax加载的过程

　　※Chrome浏览器—右键—检查—Network—刷新页面—XHR(筛选出Ajax请求)—Request Headers-- X-Requested-With: XMLHttpRequest(标记Ajax请求)

※Preview选项：可以看到响应的内容，JSON格式

※Response选项：返回的数据

三、 Ajax结果提取

用Python模拟Ajax请求

1、 分析请求

　　GET类型

2、 分析响应

　　Preview：请求的响应内容(JSON格式)

四、 实例【分析Ajax爬取今日头条街拍美图】

1、 确定为Ajax渲染

2、 爬取图片分析：data—image_list是图片的列表

用Python模拟Ajax请求，提取图片链接并下载

3、 分析URL

下滑，加载，查看多个Ajax请求的URL，发现规律

只有offset参数每次增加20，控制分页

https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1585660596769&_signature=wUHhuAAgEBAHFlg1-6xHd8FAoKAAJ.MtGLLCexBTF5ih1sbSlwZ6e6O7mqf0TYciKVg6zRGWz5yYRSmYSappEBMXSgxH6j9KQ1s8n4Qe3bCaaCn8.WylEGW8dlfapG8Y460

https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1585661780202&_signature=1cySjAAgEBATmysBz6cvIdXN05AAItYlKpL9iVK055shpH3Z6a5hUrr7JckbLG0H5sRanhNGjZzRNJ4z.rras5WORz02Nj7BsOTZQgvPbYiNfqEB21jNz0Qto1-vLMmMcac

4、实现

## Ajax
## 分析Ajax爬取今日头条街拍美图
## https://www.toutiao.com/search/?keyword=街拍

import requests
from urllib.parse import urlencode
import os
from hashlib import md5
from multiprocessing.pool import Pool
import re

## get_page(offset)加载单个Ajax请求的结果
def get_page(offset):
    ## https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1585660596769&_signature=wUHhuAAgEBAHFlg1-6xHd8FAoKAAJ.MtGLLCexBTF5ih1sbSlwZ6e6O7mqf0TYciKVg6zRGWz5yYRSmYSappEBMXSgxH6j9KQ1s8n4Qe3bCaaCn8.WylEGW8dlfapG8Y460
    ## 根据Ajax请求的url构造GET参数
    params = {
        'aid':'24',
        'app_name':'web_search',
        'offset':offset,
        'format':'json',
        'keyword':'动漫',
        'autoload':'true',
        'count':'20',
        'en_qc':'1',
        'cur_tab':'1',
        'from':'search_tab',
        'pd':'synthesis'
    }
    # urlencode()构造请求的GET参数
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
    headers = {
        'cookie': 'tt_webid=6810359770231817736; WEATHER_CITY=%E5%8C%97%E4%BA%AC; tt_webid=6810359770231817736; ttcid=0f11652ca96f4a1a8249ea9bdbeb09f966; SLARDAR_WEB_ID=09143091-be35-48e5-92cf-4b7becffd8e2; csrftoken=b991c8d1b8be5a8d3cf5d728b03eb114; s_v_web_id=verify_k8julwio_iBgoIo8r_pGRP_4qox_AXgR_SvNnatEkPaJK; tt_scid=ePFuQLeJD7KoPt53REkDvgczszTB-q1riJi0iZFd3tjgLyXBkW5z2FVlX8RWhzvadc7b',
        'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=30)
        if response.status_code == 200:
            return response.json()
    except :
        print('爬取失败！')

## 提取标题和图片
def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            title = item.get('title')
            images = item.get('image_list')
            if images:
                for image in images:
                    ## 构造生成器
                    yield {
                        'image':image.get('url'),
                        'title':title
                    }

## 保存图片
## item是get_images()返回的一个字典
def save_image(item):
    ## 根据 item 的 title 来创建文件
    title = re.sub('[/:*?"<>|\\\]', '', item.get('title'))
    if not os.path.exists(title):
        os.mkdir(title)
    try:
        ## 请求图片链接
        response = requests.get(item.get('image'))
        file_path = '{0}/{1}.{2}'.format(title, md5(response.content).hexdigest(), 'jpg')
        if not os.path.exists(file_path):
            ## 将图片以二进制格式写入文件
            with open(file_path, 'wb') as f:
                f.write(response.content)
        else:
            print('已经下载', file_path)
    except:
        print('图片保存失败！')

def main(offset):
    json = get_page(offset)
    if get_images(json):
        for item in get_images(json):
            print(item)
            save_image(item)

## 分页的起始页和终止页数
GROUP_START = 1
GROUP_END = 20

## pool多线程池
if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END+1)])
    ## map()实现多线程下载
    pool.map(main, groups)
    pool.close()
    pool.join()

运行结果：

两个报错点！！

1、TypeError:‘NoneType’ object is not iterable（类型错误：'NoneType’对象不是可迭代的）

迭代对象的值可能是None，无法迭代，需要在迭代之前进行判断！！

    ## 加 if 判断，若 get_images(json) 不为 None 就进行遍历
    if get_images(json):
        for item in get_images(json):
            print(item)
            save_image(item)

2、OSError: [Errno 22] 文件名、目录名或卷标语法不正确。: '最近刚完结的一部高分动漫，你有看过吗?'

爬取的标题直接作为文件夹名，可能会包含一些文件名不能包含的字符，需要把这些字符去掉

import re

s = '最近:刚/完\结***的一部高|||分动???漫，你有看过吗?'
title = re.sub('[/:*?"<>|\\\]', '', s)
print(title)  ## 最近刚完结的一部高分动漫，你有看过吗

posted @ 2020-04-03 18:07 kuluma 阅读(525) 评论(0) 编辑收藏举报

刷新页面返回顶部

kuluma

【BOOK】Ajax数据爬取

公告