Python 2.7.3 urllib2.urlopen 获取网页出现乱码解决方案

出现乱码的原因是，网页服务端有bug，它硬性使用使用某种特定的编码方案，而并没有按照客户端的请求头的编码要求来发送编码。

解决方案：使用chardet来猜测网页编码。

1.去chardet官网下载chardet的py源码包。

2.把chardet目录从源码包里解压到项目文件夹内。

3.通过 import chardet 来引用它，然后：

 1 response = None
 2 #尝试下载网页
 3 try:
 4     response = urllib2.urlopen("http://www.baidu.com")
 5 except Exception as e:
 6     print "错误：下载网页时遇到问题：" + str(e)
 7     return
 8 
 9 if response.code != 200:
10     print "错误：访问后，返回的状态代码（Code）并不是预期值【200】，而是【" + str(response.code) + "】"
11     return
12 
13 if response.msg != "OK":
14     print "错误：访问后，返回的状态消息并不是预期值【OK】，而是【" + response.msg + "】"
15     return
16 
17 #读取html代码
18 htmlCode = None
19 try:
20     htmlCode = response.read()
21 except Exception as e:
22     print "错误：下载完毕后，从响应流里读出网页代码时遇到问题：" + str(e)
23     return
24 
25 #处理网页编码
26 htmlCode_encode = None
27 try:
28     #猜编码类型
29     htmlCharsetGuess = chardet.detect(htmlCode)
30     htmlCharsetEncoding = htmlCharsetGuess["encoding"]
31     #解码
32     htmlCode_decode = htmlCode.decode(htmlCharsetEncoding)
33     #获取系统编码
34     currentSystemEncoding = sys.getfilesystemencoding()
35     #按系统编码，再进行编码。
36     '''
37         做这一步的目的是，让编码出来的东西，可以在python中进行处理
38         比如: 
39              key = "你好"
40              str = "xxxx你好yyyy"
41              keyPos = str.find( key )
42         如果不做再编码，这一步就可能会报错出问题
43     '''
44     htmlCode_encode = htmlCode_decode.encode(currentSystemEncoding)
45     except Exception as e:
46         print "错误：在处理网页编码时遇到问题：" + str(e)
47         return
48 #htmlCode_encode即为所求
49 return htmlCode_encode

posted on 2014-01-14 14:41 xxxteam 阅读(1797) 评论(0) 收藏举报

刷新页面返回顶部

xxxteam

Python 2.7.3 urllib2.urlopen 获取网页出现乱码解决方案

导航

公告