python中文处理 - 北京涛子

公告

1 encode() 方法以 encoding 指定的编码格式编码字符串。errors参数可以指定不同的错误处理方案。

# 语法
str.encode(encoding='UTF-8',errors='strict')

# 参数
encoding -- 要使用的编码，如"UTF-8"。
errors -- 设置不同错误的处理方案。默认为 'strict',意为编码错误引起一个UnicodeError。 其他可能得值有 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' 以及通过 codecs.register_error() 注册的任何值。

# 返回值
该方法返回编码后的字符串。

2 decode() 方法以 encoding 指定的编码格式解码字符串。默认编码为字符串编码。

# 语法
str.decode(encoding='UTF-8',errors='strict')

# 参数
encoding -- 要使用的编码，如"UTF-8"。
errors -- 设置不同错误的处理方案。默认为 'strict',意为编码错误引起一个UnicodeError。 其他可能得值有 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' 以及通过 codecs.register_error() 注册的任何值。

# 返回值
该方法返回解码后的字符串。

#!/usr/bin/python

str = "this is string example....wow!!!";
str = str.encode('base64','strict');

print "Encoded String: " + str;
print "Decoded String: " + str.decode('base64','strict')

# 输出
Encoded String: dGhpcyBpcyBzdHJpbmcgZXhhbXBsZS4uLi53b3chISE=
Decoded String: this is string example....wow!!!

中文

http://www.cnblogs.com/long2015/p/4090824.html
http://www.cnblogs.com/skynet/archive/2011/05/03/2035105.html

英文 ascII, 中文 gb2312, 中文统一（包括其他语言）使用unicode，utf-8是unicode的一种实现方式.

str  -> decode('the_coding_of_str') -> unicode
unicode -> encode('the_coding_you_want') -> str

str是字节串，由unicode经过编码(encode)后的字节组成的.
# 声明方式
s = '中文'
s = u'中文'.encode('utf-8')

>>> type('中文') 
<type 'str'> 

# 求长度(返回字节数)
>>> u'中文'.encode('utf-8') 
'\xe4\xb8\xad\xe6\x96\x87' 

>>> len(u'中文'.encode('utf-8')) 
6

unicode才是真正意义上的字符串，由字符组成

# 声明方式
s = u'中文'
s = '中文'.decode('utf-8')
s = unicode('中文', 'utf-8')  

>>> type(u'中文') 
<type 'unicode'> 

# 求长度(返回字符数)
>>> u'中文' 
u'\u4e2d\u6587' 

>>> len(u'中文') 
2

搞明白要处理的是str还是unicode, 使用对的处理方法(str.decode/unicode.encode)

判断是否为unicode/str的方法

>>> isinstance(u'中文', unicode) 
True 

>>> isinstance('中文', unicode) 
False  

>>> isinstance('中文', str) 
True 

>>> isinstance(u'中文', str) 
False

简单原则：不要对str使用encode，不要对unicode使用decode

>>> '中文'.encode('utf-8')
 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)  

>>> u'中文'.decode('utf-8') 
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode     return codecs.utf_8_decode(input, errors, True) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

不同编码转换,使用unicode作为中间编码

#s是code_A的str s.decode('code_A').encode('code_B')

文件处理,IDE和控制台

读文件  外部输入编码，decode转成unicode  
处理(内部编码，统一unicode)  encode转成需要的目标编码  
写到目标输出(文件或控制台)

统一编码，防止由于某个环节产生的乱码(环境编码，IDE/文本编辑器, 文件编码，数据库数据表编码)
py文件默认编码是ASCII, 在源代码文件中，如果用到非ASCII字符，需要在文件头部进行编码

# -*- coding: utf-8 -*-

example

 import urllib2, json
>>> resp = urllib2.urlopen('http://10.0.80.80/apis4machine/screenLog.php').read()
>>> resp
'{"status":0,"data":[{"channel":"ACM\\u5927\\u534e\\u5e9c","IP":"10.0.80.80","start_time":"2016-03-19 13:33:04","end_time":"2016-03-19 15:43:12","err_msg":"\\u8fde\\u7eed2\\u6b21\\u51fa\\u73b0\\u57ab\\u7247"}]}'

(中文是ACM大华府 : 连续2次出现垫片) 
这个中文字符集是？ \\u5927\\u534e\\u5e9c , 应该是unicode

unicode -> str,

posted on 2016-03-17 17:18 北京涛子阅读(294) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

涛子 - 简单就是美

公告

中文