python爬虫学习笔记(2)-----代理模式
一、UserAgent
UserAgent 中文意思是用户代理,简称UA,它是一个特殊字符串头,使得服务器能够识别用户
设置UA的两种方式:
1、heads
1 from urllib import request, error 2 if '__name__' == '__main__': 3 url = "http://www.baidu.com" 4 try: 5 headers = {} 6 headers['User-Agrnt'] = "User-Agent"," Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)" 7 req = request.Request(url, headers = headers) 8 rsp = request.urlopen(req) 9 html = rsp.read().decode() 10 print(html) 11 except error.HTTPError as e: 12 print(e) 13 except error.URLError as e: 14 print(e) 15 except Exception as e: 16 print(e)
2、使用add_header
1 from urllib import request, error 2 if __name__ = '__main__': 3 url = "http://www.baidu.com" 4 try: 5 req = request.Request(url) 6 req.add_header('User-Agent',"Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.33 Safari/534.3 SE 2.X MetaSr 1.0") 7 rsp = request.urlopen(req) 8 html = rsp.read().decode() 9 print(html) 10 except error.HTTPError as e: 11 print(e) 12 except error.URLEror as e: 13 print(e) 14 except Excecption as e: 15 print(e)
二、ProxyHandler(代理服务器)
由于很多网站会监测某一段时间内某个IP的访问次数,如果访问次数过多过快,它会禁止这个IP访问。所以我们可以设置一些代理服务器,就算IP被禁止依然可以换个IP继续访问。代理用来隐藏真实访问,代理也不许频繁访问某一个固定网址,所以,代理一定得多
基本使用步骤:
1、设置代理地址
2、创建ProxyHandler
3、创建Opener
4、安装Opener
获取代理服务器地址
- www.xicidaili.com
- www.goubanjia.com
1 from urllib import request, error 2 if __name__ == '__main__': 3 url = "http://www.baidu.com" 4 # 设置代理地址 5 proxy = {'http': '27.203.245.212:8060'} 6 # 创建ProxyHandler 7 proxy_handler = request.ProxyHandler(proxy) 8 # 创建Opener 9 opener = request.build_opener(proxy_handler) 10 # 安装Opener 11 request.install_opener(opener) 12 13 try: 14 rsp = request.Request(url) 15 req = request.urlopen(rsp) 16 html = req.read().decode('utf-8') 17 print(html) 18 except error.HTTPError as e: 19 print(e) 20 except error.URLError as e: 21 print(e) 22 except Exception as e: 23 print(e)
如果输出显示:<urlopen error [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。> 把第五行的IP地址换一个即可