百度贴吧检测User-Agent的奇葩之处
今天在爬去贴吧信息时,只出现部分网页代码,之后各种换工具,依旧是只能显示部分原代码.
1 from urllib import parse, request 2 from bs4 import BeautifulSoup 3 4 class Spider: 5 6 def __init__(self, keyword, begin, end): 7 self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221"} 8 #self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"} 9 self.keyword = keyword 10 self.begin = begin 11 self.end = end+1 12 13 self.visitTieba() 14 15 def visitTieba(self): 16 for page in range(self.begin, self.end): 17 print("正在读取第" + str(page) + "页") 18 pn = 50 * (page - 1) 19 url = "http://tieba.baidu.com/f?" + self.keyword + "&pn=" + str(pn) 20 req = request.Request(url, headers=self.headers) 21 html = request.urlopen(req).read() 22 23 bs = BeautifulSoup(html) 24 25 a_list = bs.select('a') 26 print("匹配a标签的数量为:", len(a_list)) 27 for content in a_list: 28 print(content) 29 30 31 def main(): 32 # keyword = input("请输入要查询的关键字") 33 # begin = int(input("请输入起始页")) 34 # end = int(input("请输入结束页")) 35 keyword = "python" 36 keyword = parse.urlencode({"kw":keyword}) 37 begin = 1 38 end = 1 39 40 spider = Spider(keyword, begin, end) 41 42 if __name__ == "__main__": 43 main()
结果如下:
但是,如果换成另一个user-agent,就是被注释掉的那个,就会有不一样的结果.
能够匹配到的结果超过400条,就不一一列举了,总之,百度贴吧可以屏蔽报头部分信息,所返回的网页源代码,大部分给注释掉了,因此匹配不到所有的东西.