urllib2下载时判断网页编码

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

对于中文网页，charset可能的值有：UTF-8, GB2312

不过urllib2有点问题，UTF-8能正确判断，但有些GB2312的网页，不能正确判断，而是返回None，例如 http://news.sina.com.cn 这点需要注意

--------------------------------------------------------------------------------

另外一个方法是用 chardet，http://chardet.feedparser.org/

但chardet性能有问题。

posted on 2011-07-06 20:28 夏日微风阅读(464) 评论(0) 编辑收藏举报

刷新页面返回顶部

夏日微风