python正则的中文处理

因工作需要，要查找中文汉字分词，因为python正则表达式\W+表示的是所有的中文字就连标点符号都包括。所以要想办法过滤掉。

参考博客：http://log.medcl.net/item/2011/03/the-chinese-deal-is-the-python/

1.匹配中文时，正则表达式规则和目标字串的编码格式必须相同

    print sys.getdefaultencoding()
    text =u"#who#helloworld#a中文x#"
    print isinstance(text,unicode)
    print text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 18: ordinal not in range(128)

print text报错
解释：控制台信息输出窗口是按照ascii编码输出的（英文系统的默认编码是ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。
改成 print(word.encode('utf8'))即可

2.//确定系统默认编码
import sys
print sys.getdefaultencoding()

3.//判断字符类型是否unicode
print isinstance(text,unicode)

4.unicode\python字符互转

# -*- coding: utf-8 -*-
unistr= u'a';
pystr=unistr.encode('utf8')
unistr2=unicode(pystr,'utf8')

#需要unicode的环境
if not isinstance(input,unicode):
    temp=unicode(input,'utf8')
else:
    temp=input

#需要pythonstr的环境
if isinstance(input,unicode):
    temp2=input.encode('utf8')
else:
    temp2=input

经实验如果脚本的# -*- coding: utf-8 -*- 设置为GBK用unicode转换的时候会报错。

posted @ 2013-10-24 14:54 Alex-Zeng 阅读(666) 评论(0) 收藏举报

刷新页面返回顶部