python 读取unicode编码文件

参考：

https://blog.csdn.net/csdn_yi_e/article/details/71037288

https://blog.csdn.net/qq_42739440/article/details/89887451

1.chardet判断编码类型

import chardet
f=open('a.txt','rb')
text=f.read()
info=chardet.detect(text)
print(info)

{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

2.编码解码读取

import chardet
f=open('a.txt',encoding='UTF-16')
text=f.read()
print(text.encode("utf-8").decode("unicode_escape"))

'1.新出吐鲁番文书及其研究'

先编码然后解码读取到了中文文字。

3.bert中unicode

import six
def convert_to_unicode(text):
    """
    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
    """
    # six_ensure_text is copied from https://github.com/benjaminp/six
    def six_ensure_text(s, encoding="unicode_escape", errors="strict"):
        if isinstance(s, six.binary_type):
            print('true')
            return s.decode(encoding, errors)#如果是字节流，那么就以指定方式解码
        elif isinstance(s, six.text_type):#如果是文本类型，直接返回
            return s
        else:
            raise TypeError("not expecting type '%s'" % type(s))

    return six_ensure_text(text, encoding="unicode_escape", errors="ignore")

f=open('a.txt',encoding=('UTF-16'))
text=f.read()
print(convert_to_unicode(text.encode("utf-8")))

true
1.新出吐鲁番文书及其研究

注意：

>>> type(text.encode("utf-8"))#经过编码之后encode类型为字节类型
<class 'bytes'>

>>> type(text)#通过open中的encoding的是文件编码方式，text类型是str
<class 'str'>

https://six.readthedocs.io/

上面的二进制类型也就是py3中的字节类型。

posted @ 2020-08-17 11:40 lypbendlf 阅读(2690) 评论(0) 收藏举报

刷新页面返回顶部

python 读取unicode编码文件

1.chardet判断编码类型

2.编码解码读取

3.bert中unicode

公告