Rscan-目标采集之Bing(必应url采集)
本文分离了Rsacn中的必应URL采集模块
运行环境:python3
需要模块:requests json
是否需要必应账号:是
考虑到获取url的准确度,本脚本调用的是官方的接口,所以需要申请一个必应的API账号,不需要任何个人资料,申请后免费7天
申请地址:https://azure.microsoft.com/en-us/free/
API说明:https://docs.microsoft.com/en-us/azure/cognitive-services/bing-custom-search/call-endpoint-python
申请到的账号中有一个key,免费7天,填到脚本里的subscriptionKey中即可
参数说明:
searchTerm 搜索关键词
result_count 设置爬取url数量
subscriptionKey API_key
# coding:utf-8
import json
import requests
class BingSearch_API(object):
def __init__(self, searchTerm, result_count, subscriptionKey):
self.subscriptionKey = subscriptionKey
self.searchTerm = searchTerm
self.result_count = int(result_count)
def get_total(self):
url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset=0&mkt=en-us'
r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
r_json = json.loads(r.text)
total = r_json['webPages']['totalEstimatedMatches']
return total
def requster(self, offset):
result = []
url = 'https://api.cognitive.microsoft.com/bing/v7.0/search?q=' + self.searchTerm + '&count=20&offset='+str(offset)+'&mkt=en-us'
r = requests.get(url, headers={'Ocp-Apim-Subscription-Key': self.subscriptionKey})
r_json = json.loads(r.text)
count = 0
for item in r_json['webPages']['value']:
i = []
i.append(item['name'])
i.append(item['url'])
#print(item['name'])
#print(item['url'])
result.append(i)
return result
def run(self):
all_result = []
total = self.get_total()
all_result.append([str(total), str(self.result_count)])
for page in range(0,round(self.result_count/20)):
result = self.requster((page)*20)
for item in result:
all_result.append(item)
return all_result
if __name__ == '__main__':
obj_bing = BingSearch_API(searchTerm='site:163.com', result_count=100, subscriptionKey='')
result = obj_bing.run()
print(result)
print(len(result))