python中的编码方式

python中常用的字符串格式有两种:一种是str类型,一种是bytes类型。

  • str类型和bytes类型的转换:
1 >>> str1 = 'hello world!'
2 >>> type(str1) ##查看str1的数据类型
3 <class 'str'>
4 >>> b = str1.encode('utf-8') ##str到bytes的转换
5 >>> b,type(b)
6 (b'hello world!', <class 'bytes'>)
7 >>> str2 = b.decode('utf-8') ##bytes到str的转换
8 >>> str2,type(str2)
9 ('hello world!', <class 'str'>)
  • 不同编码格式之间的转换:

  一般采用先按照运来的编码格式解码到str,然后再编码为bytes。

1 >>> strs = "常用的字符串格式"
2 >>> b = strs.encode('GBK')
3 >>> b
4 b'\xb3\xa3\xd3\xc3\xb5\xc4\xd7\xd6\xb7\xfb\xb4\xae\xb8\xf1\xca\xbd'
5 >>> str1 = b.decode('GBK')
6 >>> str1
7 '常用的字符串格式'
8 >>> str1.encode('UTF-8')
9 b'\xe5\xb8\xb8\xe7\x94\xa8\xe7\x9a\x84\xe5\xad\x97\xe7\xac\xa6\xe4\xb8\xb2\xe6\xa0\xbc\xe5\xbc\x8f'
  •  打开特定编码格式的文件
 1 >>> output = open('test','r',encoding='utf-8')
 2 >>> output.read()
 3 '百度百科——全球最大中文百科全书'
 4 >>> output.close()
 5 >>> output = open('test','r',encoding='GBK')
 6 >>> output.read()
 7 Traceback (most recent call last):
 8   File "<pyshell#28>", line 1, in <module>
 9     output.read()
10 UnicodeDecodeError: 'gbk' codec can't decode byte 0xa7 in position 10: illegal multibyte sequence
  • url中的非ascii码字符
 1 def url2ascii(self,url_addr): ##url转码acsii
 2     index = 0
 3     url_ascii = ""
 4     try: ##捕获转码失败,然后处理
 5         url_ascii = url_addr.encode('ascii')##bytes
 6         url_ascii = str(url_ascii, encoding = "ascii")##str
 7     except UnicodeEncodeError:
 8         url_Nonascii = re.findall(r'[\u0080-\uffff]+',url_addr) ##提取非ascii码编码范围的字符
 9         url_asciilist = re.split(r'[\u0080-\uffff]+',url_addr)  ##分割url_addr
10         for s in url_Nonascii:
11             url_ascii += url_asciilist[index] + quote(s) ##quote()函数将非ascci码转为%E4%BD%A0%E5%A5%BD格式的ascii码
12             index += 1
13         if index < len(url_asciilist):
14             url_ascii += url_asciilist[index];
15     print(url_ascii)
16     return url_ascii
  •  自动检测bytes的编码格式(Python V3.3 win32)

  chardet模块可以自动检测网页,文件等以二进制方式打开的bytes stream的编码方式。

  下载安装:

  下载网址:https://github.com/byroot/chardet

  安装过程,解压后,进入chardet-master文件夹,运行:

1 python srtup.py --help-commands
2 python setup.py build
3 python setup.py install
4 可能会需要安装setuptools模块(我安装的版本是distribute-0.6.38)

  示例:

 1 >>> import urllib.request
 2 >>> urlp = urllib.request.urlopen('http://www.baidu.com')
 3 >>> import chardet
 4 >>> chardet.detect(urlp.read())
 5 {'encoding': 'utf-8', 'confidence': 0.99}
 6 >>> urlp.close()
 7 >>> inputs = open('test','rb')
 8 >>> chardet.detect(inputs.read())
 9 {'encoding': 'GB2312', 'confidence': 0.99}
10 >>> inputs.close()
11 >>> inputs = open('test','rt')
12 >>> chardet.detect(inputs.read())
13 Traceback (most recent call last):
14   File "<pyshell#9>", line 1, in <module>
15     chardet.detect(inputs.read())
16   File "D:\program file\Python33\lib\site-packages\chardet2-2.0.3-py3.3.egg\chardet\__init__.py", line 24, in detect
17     u.feed(aBuf)
18   File "D:\program file\Python33\lib\site-packages\chardet2-2.0.3-py3.3.egg\chardet\universaldetector.py", line 98, in feed
19     if self._highBitDetector.search(aBuf):
20 TypeError: can't use a bytes pattern on a string-like object

posted on 2013-05-14 21:55  甘泉love若水  阅读(488)  评论(0编辑  收藏  举报

导航