requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误

requests不设置UA 访问baidu 得到 r.headers['Content-Type'] 是text/html 使用chrome UA： Content-Type:text/html; charset=utf-8

1.参考

iso-8859是什么？他又被叫做Latin-1或“西欧语言”

补丁：

import requests
def monkey_patch():
    prop = requests.models.Response.content
    def content(self):
        _content = prop.fget(self)
        if self.encoding == 'ISO-8859-1':
            encodings = requests.utils.get_encodings_from_content(_content)
            if encodings:
                self.encoding = encodings[0]
            else:
                self.encoding = self.apparent_encoding
            _content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')
            self._content = _content
        return _content
    requests.models.Response.content = property(content)
monkey_patch()

2.原因

In [291]: r = requests.get('http://cn.python-requests.org/en/latest/')

In [292]: r.headers.get('content-type')
Out[292]: 'text/html; charset=utf-8'

In [293]: r.encoding
Out[293]: 'utf-8'


In [294]: rc = requests.get('http://python3-cookbook.readthedocs.io/zh_CN/latest/index.html')

In [296]: rc.headers.get('content-type')
Out[296]: 'text/html'

In [298]: rc.encoding
Out[298]: 'ISO-8859-1'

response text 异常

In [312]: rc.text
Out[312]: u'\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n  <meta charset="ut
f-8">\n  \n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  \n  <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</tit
le>\n  \n\n  \n  \n  \n  \n\n  \n\n  \n  \n    \n\n  \n\n  \n  \n\n  \n    <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n  \n\n  \n        <l
ink rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n              href="genindex.html"/>\n        <link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n        <link rel="copyright"
title="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n    <link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n        <link rel="next" title

In [313]: rc.content
Out[313]: '\n\n<!DOCTYPE html>\n<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->\n<head>\n  <meta charset="utf
-8">\n  \n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  \n  <title>Python Cookbook 3rd Edition Documentation &mdash; python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3</titl
e>\n  \n\n  \n  \n  \n  \n\n  \n\n  \n  \n    \n\n  \n\n  \n  \n\n  \n    <link rel="stylesheet" href="https://media.readthedocs.org/css/sphinx_rtd_theme.css" type="text/css" />\n  \n\n  \n        <li
nk rel="index" title="\xe7\xb4\xa2\xe5\xbc\x95"\n              href="genindex.html"/>\n        <link rel="search" title="\xe6\x90\x9c\xe7\xb4\xa2" href="search.html"/>\n        <link rel="copyright" t
itle="\xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89" href="copyright.html"/>\n    <link rel="top" title="python3-cookbook 2.0.0 \xe6\x96\x87\xe6\xa1\xa3" href="#"/>\n        <link rel="next" title=

response headers有'content-type'而且没有charset而且有'text'，同时满足三个条件导致判定'ISO-8859-1'

参考文章说 python3 没有问题，实测有。

C:\Program Files\Anaconda2\Lib\site-packages\requests\utils.py

20180102 补充：# "Content-Type": "application/json" 对应 r.encoding 为 None

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

C:\Program Files\Anaconda2\Lib\site-packages\requests\adapters.py

class HTTPAdapter(BaseAdapter):
    def build_response(self, req, resp):
        # Set encoding.
        response.encoding = get_encoding_from_headers(response.headers)

3.解决办法

参考文章打补丁或：

20180102 补充： if resp.encoding == 'ISO-8859-1': 修改为 if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''): 即只处理按照协议最后返回的 'ISO-8859-1'

    if r.encoding == 'ISO-8859-1' and not 'ISO-8859-1' in headers.get('content-type', ''):
        encodings = requests.utils.get_encodings_from_content(resp.content)  #re.compile(r'<meta.*?charset  #源代码没有利用这个方法
        if encodings:
            resp.encoding = encodings[0]
        else:
            resp.encoding = resp.apparent_encoding  #models.py  chardet.detect(self.content)['encoding'] 消耗计算 # resp.text >>> if self.encoding is None: encoding = self.apparent_encoding
        print 'ISO-8859-1 changed to %s'%resp.encoding

posted @ 2017-10-26 16:22 my8100 阅读(3268) 评论(0) 编辑收藏举报

努力加载评论中...

刷新页面返回顶部

my8100

requests之headers 'Content-Type': 'text/html'误判encoding为'ISO-8859-1'导致中文text解码错误

1.参考

2.原因

3.解决办法

公告