【2024】百度指数爬虫，过程加代码

要爬取百度指数，首先我们要通过 Web 页面获取 api。

最关键的三个请求是：

/api/AddWordApi/checkWordsExists?word={testwordset} 检查关键词是否存在
/api/SearchApi/index?area=0&word={words}&area={regionCode}&startDate={startDate}&endDate={endDate} 查询关键词指数，参数含义不言自明。但返回值是加密的字符串。
/Interface/ptbk?uniqid={uniqid} uniqid 为上个请求中返回的一项，本请求返回密钥可以解密上个请求的返回值。

在页面代码中搜索 decrypt 函数，可以找到解密函数的代码，其逻辑并不复杂，可以用任何其他语言实现。

接下来我们来关注参数。经过测试，提交多个关键词查询的格式为[{"name": keyword, "wordType": 1} for keyword in sublist]，由于百度指数的组合词和添加对比功能，一次最多可添加5个词一同查询，而每个词可以以加号连接三个词查询其总和。

获得的结果，解密后就成为数组，对应日期范围内每一天的指数。注意日期跨度在365天内时，返回的数据是按天的，超过365天时，返回的数据是按周的。若要获取更长时间的每天数据，可以多次查询，然后合并。

最后是鉴权问题，提交请求需要两个参数，一个是 cookie 的 BDUSS 字段，另一个是 Cipher-Text 字段，在 Web 页面的请求中可以看到这两个字段的值，可以直接复制到代码中使用。

"""
百度指数爬虫 2024年3月
"""

# import ...

def generate_http_headers(credential):
    http_headers = {
        'Cookie': 'BDUSS=' + credential["cookie_BDUSS"],
        'Cipher-Text': credential["cipherText"],
        'Referer': 'https://index.baidu.com/v2/main/index.html',
        'Host': 'index.baidu.com',
        # ...
    }
    return http_headers


# 解密
def decrypt(ptbk, index_data):
    n = len(ptbk) // 2
    a = dict(zip(ptbk[:n], ptbk[n:]))
    return "".join([a[s] for s in index_data])

def crawl_request(keywords, startDate, endDate, regionCode, credential, expectedInterval, autoSave):
    words = keywords2json(json.dumps([
        [{"name": keyword, "wordType": 1} for keyword in sublist]
        for sublist in keywords
    ], ensure_ascii=False))

    # 第一级以逗号分隔，第二级以加号分隔
    testwordset = ','.join(['+'.join(keyword) for keyword in keywords])
    max_retries = 3  # 最大重试次数
    retries = 0  

    while retries < max_retries:
        try:
            url = f'https://index.baidu.com/api/AddWordApi/checkWordsExists?word={testwordset}'
            rsp = requests.get(url, headers=generate_http_headers(credential), timeout=10).json()
            if rsp['data']['result']:
                # 关键词不存在或组合里有不存在的关键词
                return -1

            url = f'http://index.baidu.com/api/SearchApi/index?area=0&word={words}&area={regionCode}&startDate={startDate}&endDate={endDate}'
            rsp = requests.get(url, headers=generate_http_headers(credential), timeout=10).json()

            # 获取解密秘钥
            data = rsp['data']['userIndexes']
            uniqid = rsp['data']['uniqid']
            url = f'https://index.baidu.com/Interface/ptbk?uniqid={uniqid}'
            ptbk = requests.get(url, headers=generate_http_headers(credential), timeout=10).json()['data']

            res = [0 for _ in range(len(data))]
            # 已经获取到结果数组
            return res
        except Exception as e:
            retries += 1
            time.sleep(random.randint(1, 3)) 
    if retries == max_retries:
        # 多次失败，账号问题或者网络问题
        return -1


regions = {}
with open('./webui/public/city.json', encoding='utf-8') as f:
            regions = json.load(f)

def crawl(keywords, startDate, endDate, regionCode, credential, expectedInterval, autoSave):
    res = {regionCode: []}
    for i in range(0, len(keywords), 5):
        # 5 个关键词一组，进行查询
        selected_keywords = keywords[i:i + 5]

        t = crawl_request(selected_keywords, startDate, endDate, regionCode, credential, expectedInterval, autoSave)
        if t == -1:
            continue
        res[regionCode].extend(t)
        time.sleep(expectedInterval / 1000 + random.randint(1, 3) / 2)

    return res

带上 GUI 和额外功能的完整版参见 Github: https://github.com/Ofnoname/baidu-index-spider

posted @ 2024-06-03 13:06 Ofnoname 阅读(832) 评论(3) 编辑收藏举报

刷新页面返回顶部

Ofnoname

就是如此！

【2024】百度指数爬虫，过程加代码