近日发现原来查询Yahoo排名的一支python程式不能正常运行了,Debug后发现一个提示:
WARNING:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
查了Stackoverflow发现因为对方启用了Gzip压缩.使我们抓下来的页面内容需经过gzip.GzipFile方法解出来才能用.
判断是否gzip的方法,只要:
1 page = urllib2.urlopen(req) 2 print page.info().get('Content-Encoding')
输出结果如果是'Gzip'的话.就是已经经过Gzip压缩的.
附解决方法及通用解压方式:
import gzip import zlib import StringIO def decode (self,page): encoding = page.info().get("Content-Encoding") if encoding in ('gzip', 'x-gzip', 'deflate'): content = page.read() if encoding == 'deflate': data = StringIO.StringIO(zlib.decompress(content)) else: data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content)) page = data.read() return page # call run -- if __name__ == "__main__": response = urllib2.urlopen(req) content = self.decode(response) #加入gzip解压 response.close() ##防止内存泄漏 关闭连接 content = BeautifulSoup(''.join(content))