Rscan-目标采集之Bing(必应url采集)

本文分离了Rsacn中的必应URL采集模块
运行环境:python3
需要模块:requests json
是否需要必应账号:是

考虑到获取url的准确度,本脚本调用的是官方的接口,所以需要申请一个必应的API账号,不需要任何个人资料,申请后免费7天

申请地址:https://azure.microsoft.com/en-us/free/
API说明:https://docs.microsoft.com/en-us/azure/cognitive-services/bing-custom-search/call-endpoint-python

申请到的账号中有一个key,免费7天,填到脚本里的subscriptionKey中即可

参数说明:
searchTerm 搜索关键词
result_count 设置爬取url数量
subscriptionKey API_key

# coding:utf-8
import json
import requests


class BingSearch_API(object):

    def __init__(self, searchTerm, result_count, subscriptionKey):
        self.subscriptionKey = subscriptionKey
        self.searchTerm = searchTerm
        self.result_count = int(result_count)


    def get_total(self):

        url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset=0&mkt=en-us'
        r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
        r_json = json.loads(r.text)
        total = r_json['webPages']['totalEstimatedMatches']

        return total

    def requster(self, offset):
        result = []
        url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset='+str(offset)+'&mkt=en-us'
        r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
        r_json = json.loads(r.text)
        count = 0
        for item in r_json['webPages']['value']:
            i = []
            i.append(item['name'])
            i.append(item['url'])
            #print(item['name'])
            #print(item['url'])
            result.append(i)
        return result

    def run(self):
        all_result = []
        total = self.get_total()
        all_result.append([str(total), str(self.result_count)])
        for page in range(0,round(self.result_count/20)):
            result = self.requster((page)*20)
            for item in result:
                all_result.append(item)
        return all_result


if __name__ == '__main__':
    obj_bing = BingSearch_API(searchTerm='site:163.com', result_count=100, subscriptionKey='')
    result = obj_bing.run()
    print(result)
    print(len(result))

posted @ 2019-09-04 13:55  reuodut  阅读(787)  评论(0编辑  收藏  举报