由time.tzname返回值引发的对str、bytes转换时编码问题实践

Windows 10家庭中文版，Python 3.6.4，

下午复习了一下time模块，熟悉一下其中的各种时间格式的转换：时间戳浮点数、struct_tm、字符串，还算顺利。

可是，测试其中的time.tzname属性时遇到了乱码，如下：

1 >>> import time
2 >>> time.tzname
3 ('ÖÐ¹ú±ê×¼Ê±¼ä', 'ÖÐ¹úÏÄÁîÊ±')

返回了一个元组，可是，乱码怎么看得懂！

补充：time.tzname

A tuple of two strings: the first is the name of the local non-DST timezone, the second is the name of the local DST timezone.

从结果来看，返回的是两个Unicode字符串组成的元组。

那么，这两个字符串用的是什么编码呢？怎么转换为孤可以读的懂得信息呢？

网上搜索到一篇文章（https://www.oschina.net/question/2927993_2199064?sort=default），解决方法为：

1 a = time.tzname[0]
2 b = a.encode('latin-1').decode('gbk')
3 print(b)

说明，后面的gbk更改为gb2312也是可以的。

测试的b：

中国标准时间

上面代码解释（参考链接1中会解释的更清楚）：

使用encode将字符串转换为bytes，再使用decode将bytes转换为字符串，最后得到一个gbk编码的字符串，此字符串在Python IDLE就可以正常显示了。

能看懂了。可是，为什么要做这样的转换呢？为何是latin-1、gbk呢？继续dig

补充：

除了使用encode、decode实现str、bytes转换外，还可以使用str()、bytes()来执行两者的转换，下面会用到。

补充：

str.encode(encoding="utf-8", errors="strict")
Return an encoded version of the string as a bytes object. Default encoding is 'utf-8'.

bytes.decode(encoding="utf-8", errors="strict")
bytearray.decode(encoding="utf-8", errors="strict")
Return a string decoded from the given bytes. Default encoding is 'utf-8'.

疑问：怎么判断字符串用的什么编码方式呢？

字符串，可以认为是字符组成的数组，那么，获取每个字符串中的字符在内存中的表示如何？是什么样的整数？当然，Python中是没有单纯的字符的，都是字符串。

在参考链接2中，找到了将字符转换为整数的函数——ord：

下面是使用

>>> tzname = time.tzname
>>> for ch in tzname[0]:
    print("0x%x" % ord(ch))

    
0xd6
0xd0
0xb9
0xfa
0xb1
0xea
0xd7
0xbc
0xca
0xb1
0xbc
0xe4

遗憾的是，由于自己水平有限，无法根据上面的信息使用的是何种编码方式。

下面是更进一步测试

item = time.tzname[0]

Test 1:

bsx0 = bytes(item, encoding="gbk")

发生异常：

UnicodeEncodeError: 'gbk' codec can't encode character '\xd6' in position 0: illegal multibyte sequence

看来上面的做法是不对的。上面的bytes()函数类似于encode的功能——字符串str 转 bytes。

Test 2:

bs0 = bytes(item, encoding="utf-8")
print(bs0)
print(chardet.detect(bs0))
print("OK 1? ", str(bs0, 'utf-8'))
print('0x%x' % ord(str(bs0, 'utf-8')[0]))

测试结果：

b'\xc3\x96\xc3\x90\xc2\xb9\xc3\xba\xc2\xb1\xc3\xaa\xc3\x97\xc2\xbc\xc3\x8a\xc2\xb1\xc2\xbc\xc3\xa4'
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} 使用chardet.detect检测到的编码类型
OK 1? ÖÐ¹ú±ê×¼Ê±¼ä 还是乱码，和IDLE中一样
0xd6 先执行bytes、再执行str，两次都用utf-8，结果，得到的第一个字符的十六进制仍然是0Xd6

Test 3:

bs = bytes(item, encoding="latin-1")
print(bs)
print(chardet.detect(bs))
str_bs = str(bs, 'gbk')
print(str_bs)
print('0x%x' % ord(str_bs[0]))

print(bytes(str_bs, encoding='gbk'))

测试结果：

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' 字符串转bytes时使用了latin-1，得到的编码，和打印每个字符的16进制的结果一致
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'} 可是，使用chardet.detect检测到的编码居然是GB2312
中国标准时间 bytes使用gbk（gb2312也可以）转换为str后输出的结果，好了，不是乱码了
0x4e2d 查看上面的字符串的第一个字符的十六进制数值，这次不一样了，

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' 使用gbk编码得到的bytes，和前面使用latin-1编码得到的bytes一样啊！

Test 4:

继续上面的Test 3进行测试：str_bs是上面使用gbk转换后得到的字符串

bs2 = bytes(str_bs, encoding='utf-8')
print(bs2)
print(chardet.detect(bs2))
print("OK 2? ", str(bs2, 'utf-8'))
print('0x%x' % ord(str(bs2, 'utf-8')[0]))

测试结果：

b'\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4' 将编码为gbk字符串用utf-8转换为bytes，结果和Test 3中得到的不一样，
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} 检测到编码为utf-8，
OK 2? 中国标准时间也显示了看得懂的字符串
0x4e2d 第一个字符的十六进制，

疑问

问题在哪儿呢？为何孤要将字符串转换为UTF-8呢？

Unicode字符编码、UTF-8、GBK、GB2312到底有什么关系呢？

b'\xd6\xd0\xb9\xfa\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'

怎么转换为：

b'\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4'

计算方法是什么？

Test 5:

new_item = item.encode('latin-1').decode('gbk')
print('OK 3?', new_item)
print('0x%x' % ord(new_item[0]))

new_item2 = new_item.encode('gbk').decode('utf8')
print(new_item2)

测试结果：

OK 3? 中国标准时间
0x4e2d
Traceback (most recent call last):
File "D:\eclipse\workspace\zl0425\src\test\aug\time01.py", line 52, in <module>
new_item2 = new_item.encode('gbk').decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 0: invalid continuation byte 出错了！

疑惑：

在Test 3中，原始字符串使用latin-1转换为bytes 再使用gbk转化为字符串；

在Test 4中，将Test 3得到的gbk转化来的字符串使用utf-8转换为bytes 再用utf-8转换为字符串；

latin-1 --> gbk -->utf-8，没有出错，可在Test 5中使用encode、decode时出错了呢？

将出错语句中的gbk更改为utf8，结果，new_item2中显示正常了：

new_item2 = new_item.encode('utf8').decode('utf8')

结果：

中国标准时间

1723 还是不完全明白，晚点再看看

1805 开机，电量满满的，再战此问题

看过参考链接4、5，并对汉字“汉”做了一些实验，发现，无论是 encode还是decode，都是对内存中的字节进行操作。

下面是使用bytes()、str()函数进行测试的结果：

# 本身就是Unicode字符
>>> han = '汉'
# 输出 汉 在Unicode字符集中的 编码
>>> ord(han)
27721
>>> print('0x%x' % ord(han))
0x6c49


# 使用latin-1将Unicode字符——大于255——转换为bytes：异常，无法解析
# Unicode字符小于256时是可以的！
>>> bytes(han, encoding='latin-1')
Traceback (most recent call last):
  File "<pyshell#63>", line 1, in <module>
    bytes(han, encoding='latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u6c49' in position 0: ordinal not in range(256)

# Unicode字符 使用 utf-8转换为bytes，成功
# 用什么解码，就用什么进行编码
>>> hanbs = bytes(han, encoding='utf-8')
# 三个字节的UTF-8编码
>>> hanbs
b'\xe6\xb1\x89'

# 将UTF-8编码得到的字节使用latin-1编码转换为字符串
# 是按照上面的每个直接进行处理，结果得到一个长度为3的字符串，存在看不懂的乱码
# \xe6、\xb1、\x89分别代表一个字符
# 此时是Unicode字符，但小于256
>>> han_latin = str(hanbs, 'latin-1')
>>> han_latin
'æ±\x89'

# 将latin-1编码的字符串使用utf-8转换为bytes
# 每个字符一个转换，三个字符就是三个转换
# 结果得到下面的bytes——utf-8编码的bytes，此时有6个字节了
# 每两个字节代表一个 前面han_latin中的一个字符（Unicode字符）
>>> hanbs2 = bytes(han_latin, encoding='utf-8')
>>> hanbs2
b'\xc3\xa6\xc2\xb1\xc2\x89'
>>> str(hanbs2, encoding='utf-8')
'æ±\x89'

# 还是用latin-1编码将latin-1编码的字符串转换为bytes吧
# 在把bytes转换为utf-8编码的字符串，又恢复了“汉”
>>> hanbs3 = bytes(han_latin, encoding='latin-1')
>>> hanbs3
b'\xe6\xb1\x89'
>>> str(hanbs3, encoding='utf-8')
'汉'


# 对字母a进行测试
>>> zimu = 'a'
>>> ord(zimu)
97
>>> print('0x%x' % ord(zimu))
0x61

>>> zimubs = bytes(zimu, encoding='utf-8')
>>> zimubs
b'a'
>>> zimu_latin = str(zimubs, 'latin-1')
>>> zimu_latin
'a'

# 无论如何转换，得到的bytes都是b'a'，
# 因为latin-1、utf-8编码对小于256的是兼容的——相同
>>> bytes(zimu_latin, 'utf-8')
b'a'