爬虫关于编解码

1.现象如下：

Traceback (most recent call last):
  File "E:\spiders\caipiao.py", line 37, in <module>
    print(response.content.decode('gbk', errors='strict'))
UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2708: illegal multibyte sequence

2.原因是编码不能处理特殊字符（即使你使用的编码是对的，但是他能解码大部分字符，却解决不了特殊字符的解码问题），所以特殊字符的处理出错应该被忽略掉，或者用特殊的字符代替，如下一个例子

import requests
import chardet


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en-AS;q=0.7,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'sec-ch-ua': '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

response = requests.get('https://www.cjcp.cn/kaijiang/', headers=headers)
print(response.text)

1.用response.text 属性可以直接解码，没有问题

2.但是当我们使用预测的编码进行解码，依然出错，如：

print(response.content.decode(response.apparent_encoding))

3.使用编码检测后，用检测到的编码进行解码，依然出错

detected_encoding = chardet.detect(response.content)['encoding']
#
# # 打印猜测的编码
print(detected_encoding)
# # 根据猜测的编码解码
decoded_content = response.content.decode(detected_encoding)

# 打印解码后的网页内容
print(decoded_content)

2.分析原因：

1.第一个问题，其实三种情况用的其实是同一种编码区解码，为什么后面两种不行呢？

2.看下第一种情况下的源码，看下标红的那行代码，就是她在转成字符串的时候，用空白字符策略替换不能解码的字符了，所以他能正常解码

 @property
    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using
        ``charset_normalizer`` or ``chardet``.

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return str('')

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors='replace')
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors='replace')

        return content

3.解决问题，依葫芦画瓢，忽略特殊字符，或者用空白字符替换完都能正常处理

print(response.content.decode('gbk', errors='replace'))
print(response.content.decode('gbk', errors='ignore'))

4.题外话，我们后面两种解码方式不行的原因是因为他执行了严格模式，当解码出错，会进行报错吗，等同于

print(response.content.decode('gbk', errors='strict'))

5.策略都有哪些

在Python中，.decode() 方法的 errors 参数允许你指定如何处理解码过程中遇到的编码错误。以下是几种常见的错误处理策略及其描述：

'strict'：这是默认的错误处理策略。如果遇到任何编码错误，将抛出 UnicodeDecodeError 异常。

'replace'：如前所述，将所有无法解码的字节替换为一个占位符（通常是 Unicode 替换字符 ``）。

'ignore'：忽略所有无法解码的字节，这意味着这些字节将不会出现在解码后的字符串中。

'xmlcharrefreplace'：将无法解码的字节替换为 XML 特征引用（例如，&#nnnn;）。

'backslashreplace'：将无法解码的字节替换为它们的反斜杠转义序列（例如，\xhh）。

'namereplace'：将无法解码的字节替换为它们的 Unicode 名称（例如，\uXXXX 或 \U00000XXXX）。

'surrogateescape'：将无法解码的字节替换为 Unicode 代理对（surrogate pairs），每个字节被替换为一个 Unicode 代理项（在范围 \uD800 到 \uDBFF 之间）。

'leftstrip'：类似于 'strict'，但会忽略开头的无法解码的字节。

'strip'：类似于 'strict'，但会忽略所有无法解码的字节（开头、中间和结尾）。

这些策略可以根据不同的场景和需求来选择，以确保数据的正确处理和解码。例如，如果你希望确保程序不会因为编码错误而中断，可能会选择 'replace' 或 'ignore' 策略。如果你需要保留所有数据，并且愿意处理可能的异常，'strict' 策略可能是更好的选择。

posted @ 2024-12-20 14:27 阿布_alone 阅读(6) 评论(0) 编辑收藏举报

刷新页面返回顶部

阿布alone

爬虫关于编解码

公告