百度贴吧检测User-Agent的奇葩之处

  今天在爬去贴吧信息时,只出现部分网页代码,之后各种换工具,依旧是只能显示部分原代码.

 1 from urllib import parse, request
 2 from bs4 import BeautifulSoup
 3 
 4 class Spider:
 5 
 6     def __init__(self, keyword, begin, end):
 7         self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221"}
 8         #self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"}
 9         self.keyword = keyword
10         self.begin = begin
11         self.end = end+1
12 
13         self.visitTieba()
14 
15     def visitTieba(self):
16         for page in range(self.begin, self.end):
17             print("正在读取第" + str(page) + "")
18             pn = 50 * (page - 1)
19             url = "http://tieba.baidu.com/f?" + self.keyword + "&pn=" + str(pn)
20             req = request.Request(url, headers=self.headers)
21             html = request.urlopen(req).read()
22 
23             bs = BeautifulSoup(html)
24 
25             a_list = bs.select('a')
26             print("匹配a标签的数量为:", len(a_list))
27             for content in a_list:
28                 print(content)
29 
30 
31 def main():
32     # keyword = input("请输入要查询的关键字")
33     # begin = int(input("请输入起始页"))
34     # end = int(input("请输入结束页"))
35     keyword = "python"
36     keyword = parse.urlencode({"kw":keyword})
37     begin = 1
38     end = 1
39 
40     spider = Spider(keyword, begin, end)
41 
42 if __name__ == "__main__":
43     main()

结果如下:

 

  但是,如果换成另一个user-agent,就是被注释掉的那个,就会有不一样的结果.

 

  能够匹配到的结果超过400条,就不一一列举了,总之,百度贴吧可以屏蔽报头部分信息,所返回的网页源代码,大部分给注释掉了,因此匹配不到所有的东西.

posted @ 2018-05-06 22:51  六师兄丶  阅读(534)  评论(0编辑  收藏  举报