python爬虫学习笔记(2)-----代理模式

一、UserAgent

  UserAgent 中文意思是用户代理,简称UA,它是一个特殊字符串头,使得服务器能够识别用户

  设置UA的两种方式:

  1、heads

 1 from urllib import request, error
 2 if '__name__' == '__main__':
 3     url = "http://www.baidu.com"
 4     try:
 5         headers = {}
 6         headers['User-Agrnt'] = "User-Agent"," Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)"
 7         req = request.Request(url, headers = headers)
 8         rsp = request.urlopen(req)
 9         html = rsp.read().decode()
10         print(html)
11     except error.HTTPError as e:
12         print(e)
13     except error.URLError as e:
14         print(e)
15     except Exception as e:
16         print(e)

  2、使用add_header

 1 from urllib import request, error
 2 if __name__ = '__main__':
 3     url = "http://www.baidu.com"
 4     try:
 5         req = request.Request(url)
 6         req.add_header('User-Agent',"Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0")
 7         rsp = request.urlopen(req)
 8         html = rsp.read().decode()
 9         print(html)
10     except error.HTTPError as e:
11         print(e)
12     except error.URLEror as e:
13         print(e)
14     except Excecption as e:
15         print(e)

 二、ProxyHandler(代理服务器)

  由于很多网站会监测某一段时间内某个IP的访问次数,如果访问次数过多过快,它会禁止这个IP访问。所以我们可以设置一些代理服务器,就算IP被禁止依然可以换个IP继续访问。代理用来隐藏真实访问,代理也不许频繁访问某一个固定网址,所以,代理一定得多

  基本使用步骤:

    1、设置代理地址

    2、创建ProxyHandler

    3、创建Opener

    4、安装Opener

  获取代理服务器地址

    - www.xicidaili.com

    -  www.goubanjia.com

 1 from urllib  import request, error
 2 if __name__ == '__main__':
 3     url = "http://www.baidu.com"
 4     # 设置代理地址
 5     proxy = {'http': '27.203.245.212:8060'}
 6     # 创建ProxyHandler
 7     proxy_handler = request.ProxyHandler(proxy)
 8     # 创建Opener
 9     opener = request.build_opener(proxy_handler)
10     # 安装Opener
11     request.install_opener(opener)
12     
13     try:
14         rsp = request.Request(url)
15         req = request.urlopen(rsp)
16         html = req.read().decode('utf-8')
17         print(html)
18     except error.HTTPError as e:
19         print(e)
20     except error.URLError as e:
21         print(e)
22     except Exception as e:
23         print(e)

  如果输出显示:<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。>   把第五行的IP地址换一个即可

 

posted @ 2018-09-13 08:03  月光男神  阅读(314)  评论(0编辑  收藏  举报