python3的编码问题

Python3对文本(str)和二进制数据(bytes)作了更为清晰的区分。

文本默认是以Unicode编码（python2默认是ascii），由str类型表示，二进制数据则由bytes类型表示。

str='中文ENGLISH'

str是文本类型，即str类型

>>> str.encode('utf-8')
b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'
>>> str.encode('gb2312')
b'\xd6\xd0\xce\xc4ENGLISH'
>>> bytes(str,'utf-8')
b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'

bytes()函数同str.encode()，即把str类型编码为bytes类型

>>> b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'.decode('utf-8')
'中文ENGLISH'

解码过程，即把bytes数据转化为str

>>> b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'.encode('utf-8')
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'.encode('utf-8')
AttributeError: 'bytes' object has no attribute 'encode'

不能把bytes数据继续编码为bytes

>>> '中文ENGLISH'.decode('utf-8')
Traceback (most recent call last):
  File "<pyshell#44>", line 1, in <module>
    '中文ENGLISH'.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'

也不能把str数据继续解码为str

即编码过程是从str到bytes,解码过程是从bytes到str。

>>> b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'.decode('gb2312')
Traceback (most recent call last):
  File "<pyshell#45>", line 1, in <module>
    b'\xe4\xb8\xad\xe6\x96\x87ENGLISH'.decode('gb2312')
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xad in position 2: illegal multibyte sequence

上面是把以utf-8编码的bytes以gb2312的方式解码，结果出错了，因为0xad没有对应的gb2312编码

如果想知道一串bytes码是以何种unicode编码方式编码的，该如何呢？这个其实是无法百分之百确定的，不然的话乱码就不会发生了。

第三方库chardet，使用函数detect可以“猜”出编码方式。

from chardet import detect
>>> detect(b'\xe4\xb8\xad\xe6\x96\x87ENGLISH')
{'confidence': 0.7525, 'encoding': 'utf-8'}

这里置信0.7525，可以简单理解为概率0.7525，这里只有两个中文字符，如果bytes足够长，那么置信肯定更高

>>> detect(b'\xe6\x88\x91\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87\xe6\x88\x91\xe7\x9c\x9f\xe7\x9a\x84\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87')
{'confidence': 0.99, 'encoding': 'utf-8'}

这里有10个中文字符，结果置信就是0.99了

__________________________________________________________________________________

从txt文件读取的问题

有两个文件ansi.txt和utf8.txt，分别保存为ansi编码和utf-8编码，里面都是‘中文ENGLISH’

>>> f_ansi=open(r'd:\ansi.txt','r')
>>> f_ansi.read()
'中文ENGLISH'

>>> f_utf8=open(r'd:\utf8.txt','r')
>>> f_utf8.read()
'锘夸腑鏂嘐NGLISH'

>>> f_utf8=open(r'd:\utf8.txt','r',encoding='utf-8')
>>> f_utf8.read()
'\ufeff中文ENGLISH'
#带BOM的utf8

记事本的ansi编码为系统本地编码，我的是gbk，所以ansi.txt的编码方式是gbk

open()函数的encoding参数默认是本地编码，也就是gbk，所以直接打开读取ansi.txt是可以的

直接打开utf8编码的txt会以gbk的解码方式读取，所以会出现乱码

验证一下

>>> '锘夸腑鏂嘐NGLISH'.encode('gbk').decode('utf-8')
'\ufeff中文ENGLISH'

posted @ 2016-03-12 20:50 fj0716 阅读(6068) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

fj0716

数据、代码

python3的编码问题

公告