python-urllib-和-urllib2-之间的区别
作为一个Python菜鸟,之前一直懵懂于urllib和urllib2,以为2是1的升级版。今天看到老外写的一篇《Python: difference between urllib and urllib2》才明白其中的区别。
You might be intrigued by the existence of two separate URL modules in Python -urllib
and urllib2
. Even more intriguing: they are not alternatives for each other. So what is the difference between urllib
and urllib2
, and do we need them both?
你可能对于Python中两个独立存在的-urllib2和-urllib2感到好奇。更有趣的是:它们并不是可以相互代替的。那么这两个模块间的区别是什么,并且这两个我们都需要吗?
urllib
and urllib2are both Python modules that do URL request related stuff but offer different functionalities. Their two most significant differences are listed below:
urllib 和urllib2都是接受URL请求的相关模块,但是提供了不同的功能。两个最显著的不同如下:
urllib2
can accept aRequest
object to set the headers for a URL request,urllib
accepts only a URL. That means, you cannot masquerade your User Agent string etc.urllib2可以接受一个Request类的实例来设置URL请求的headers,urllib仅可以接受URL。这意味着,你不可以伪装你的User Agent字符串等。
urllib
provides theurlencode
method which is used for the generation of GET query strings,urllib2
doesn’t have such a function. This is one of the reasons whyurllib
is often used along withurllib2
.urllib提供urlencode方法用来GET查询字符串的产生,而urllib2没有。这是为何urllib常和urllib2一起使用的原因。
For other differences between urllib
and urllib2
refer to their documentations, the links are given in the References section.
Tip: if you are planning to do HTTP stuff only, check out httplib2, it is much better than httplib or urllib or urllib2.
提示:如果你仅做HTTP相关的,看一下httplib2,比其他几个模块好用。
httplib实现了HTTP和HTTPS的客户端协议,一般不直接使用,在python更高层的封装模块中(urllib,urllib2)使用了它的http实现。
import httplib
conn = httplib.HTTPConnection("google.com") conn.request('get', '/') print conn.getresponse().read() conn.close()
httplib.HTTPConnection ( host [ , port [ , strict [ , timeout ]]] )
HTTPConnection类的构造函数,表示一次与服务器之间的交互,即请求/响应。参数host表示服务器主机,如:http://www.csdn.net/;port为端口号,默认值为80; 参数strict的 默认值为false, 表示在无法解析服务器返回的状态行时( status line) (比较典型的状态行如: HTTP/1.0 200 OK ),是否抛BadStatusLine 异常;可选参数timeout 表示超时时间。 HTTPConnection.request ( method , url [ , body [ , headers ]] )
调用request 方法会向服务器发送一次请求,method 表示请求的方法,常用有方法有get 和post ;url 表示请求的资源的url ;body 表示提交到服务器的数据,必须是字符串(如果method 是”post” ,则可以把body 理解为html 表单中的数据);headers 表示请求的http 头。 import httplib
conn = httplib.HTTPConnection("www.g.com", 80, False) conn.request('get', '/', headers = {"Host": "www.google.com", "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.1) Gecko/20090624 Firefox/3.5", "Accept": "text/plain"}) res = conn.getresponse() print 'version:', res.version print 'reason:', res.reason print 'status:', res.status print 'msg:', res.msg print 'headers:', res.getheaders() #html #print '\n' + '-' * 50 + '\n' #print res.read() conn.close() Httplib模块中还定义了许多常量,如: req = urllib2.Request('http://pythoneye.com')
response = urllib2.urlopen(req) the_page = response.read() FTP同样:
req = urllib2.Request('ftp://pythoneye.com')
urlopen返回的应答对象response有两个很有用的方法info()和geturl()
values ={'body' : 'test short talk','via':'xxxx'}
data = urllib.urlencode(values) req = urllib2.Request(url, data)
get方式: data['name'] = 'Somebody Here'
data['location'] = 'Northampton' data['language'] = 'Python' url_values = urllib.urlencode(data) url = 'http://pythoneye.com/example.cgi' full_url = url + '?' + url_values data = urllib2.open(full_url) 使用Basic HTTP Authentication: import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password(realm='PDQ Application', uri='https://pythoneye.com/vecrty.py', user='user', passwd='pass') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www. pythoneye.com/app.html') 使用代理ProxyHandler: proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.HTTPBasicAuthHandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = build_opener(proxy_handler, proxy_auth_handler) # This time, rather than install the OpenerDirector, we use it directly: opener.open('http://www.example.com/login.html') URLError–HTTPError: from urllib2 import Request, urlopen, URLError, HTTPError req = Request(someurl) try: response = urlopen(req) except HTTPError, e: print 'Error code: ', e.code except URLError, e: print 'Reason: ', e.reason else: ............. 或者: from urllib2 import Request, urlopen, URLError
req = Request(someurl) try: response = urlopen(req) except URLError, e: if hasattr(e, 'reason'): print 'Reason: ', e.reason elif hasattr(e, 'code'): print 'Error code: ', e.code else: ............. 通常,URLError在没有网络连接(没有路由到特定服务器),或者服务器不存在的情况下产生 req = urllib2.Request('http://pythoneye.com')
try: urllib2.urlopen(req) except URLError, e: print e.reason print e.code print e.read() 最后需要注意的就是,当处理URLError和HTTPError的时候,应先处理HTTPError,后处理URLError class HTTPHandler(AbstractHTTPHandler):
def http_open(self, req): return self.do_open(httplib.HTTPConnection, req) http_request = AbstractHTTPHandler.do_request_ HTTPHandler是Openers当中的默认控制器之一,看到这个代码,证实了urllib2是借助于httplib实现的,同时也证实了Openers和Handlers的关系。 |