python 中文乱码解决方案

python 处理文字内容时，常常遇到编码的问题。

汉字常用的两种编码方式为 utf8 和 gbk，解析一个 txt 文件或者一个字符串时经常会遇到编码问题。

对于一行文本，我们分别尝试用 utf8 或者 gbk 去解码，哪一个解码内容多选择哪一个

def force_decode(string:bytes) ->str:
    """
    sometimes neither gbk nor gbk can decode succseefully from string
    select longger decode result from utf8 or gbk
    """
    if not isinstance(string, bytes):
        raise ValueError('expected bytes array')
    decode_chars_count = []
    for i in ['utf8', 'gbk']:
        try:
            return string.decode(i)
        except UnicodeDecodeError as ex:
            decode_chars_count.append(ex.start)
    # neither utf8 or gbk decode successfully
    # select the longer decode one
    utf8_len, gbk_len = decode_chars_count
    selected_encoding = 'utf8' if utf8_len > gbk_len else 'gbk'
    return string.decode(selected_encoding, errors='ignore')

代码链接：https://gist.github.com/albertofwb/b53bf32adca5c245c6dee6642ca5463d

posted @ 2020-06-24 16:46 SurfUniverse 阅读(336) 评论(0) 收藏举报

刷新页面返回顶部

SurfUniverse

python 中文乱码解决方案

公告