Rscan-目标采集之Bing（必应url采集）

本文分离了Rsacn中的必应URL采集模块
运行环境：python3
需要模块：requests json
是否需要必应账号：是

考虑到获取url的准确度，本脚本调用的是官方的接口，所以需要申请一个必应的API账号，不需要任何个人资料，申请后免费7天

申请地址：https://azure.microsoft.com/en-us/free/
API说明：https://docs.microsoft.com/en-us/azure/cognitive-services/bing-custom-search/call-endpoint-python

申请到的账号中有一个key，免费7天，填到脚本里的subscriptionKey中即可

参数说明：
searchTerm 搜索关键词
result_count 设置爬取url数量
subscriptionKey API_key

# coding:utf-8
import json
import requests


class BingSearch_API(object):

    def __init__(self, searchTerm, result_count, subscriptionKey):
        self.subscriptionKey = subscriptionKey
        self.searchTerm = searchTerm
        self.result_count = int(result_count)


    def get_total(self):

        url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset=0&mkt=en-us'
        r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
        r_json = json.loads(r.text)
        total = r_json['webPages']['totalEstimatedMatches']

        return total

    def requster(self, offset):
        result = []
        url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset='+str(offset)+'&mkt=en-us'
        r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
        r_json = json.loads(r.text)
        count = 0
        for item in r_json['webPages']['value']:
            i = []
            i.append(item['name'])
            i.append(item['url'])
            #print(item['name'])
            #print(item['url'])
            result.append(i)
        return result

    def run(self):
        all_result = []
        total = self.get_total()
        all_result.append([str(total), str(self.result_count)])
        for page in range(0,round(self.result_count/20)):
            result = self.requster((page)*20)
            for item in result:
                all_result.append(item)
        return all_result


if __name__ == '__main__':
    obj_bing = BingSearch_API(searchTerm='site:163.com', result_count=100, subscriptionKey='')
    result = obj_bing.run()
    print(result)
    print(len(result))

posted @ 2019-09-04 13:55 reuodut 阅读(787) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Reuodut's Blog 安全开发者之路~

Rscan-目标采集之Bing（必应url采集）

公告