python去除特殊字符
去除数字,特殊字符,只保留汉字
import re
s = '1123*#$ 中abc国'
str = re.sub('[a-zA-Z0-9'!"#$%&\'()*+,-./:;<=>?@,。?★、…【】《》?“”‘'![\\]^_`{|}~\s]+', "", s)
# 去除不可见字符
str = re.sub('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+', '', x)
print(str)
# 结果为:中国
去除特殊字符,只保留汉子,字母、数字
import re
string = "123我123456abcdefgABCVDFF?/ ,。,.:;:''';'''[]{}()()《》"
print(string)
123我123456abcdefgABCVDFF?/ ,。,.:;:''';'''[]{}()()《》
sub_str = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])","",string)
print(sub_str)
123我123456abcdefgABCVDFF
正则表达式说明
函数 | 说明 |
---|---|
sub(pattern,repl,string) | 把字符串中的所有匹配表达式pattern中的地方替换成repl |
[^**] | 表示不匹配此字符集中的任何一个字符 |
\u4e00-\u9fa5 | 汉字的unicode范围 |
\u0030-\u0039 | 数字的unicode范围 |
\u0041-\u005a | 大写字母unicode范围 |
\u0061-\u007a | 小写字母unicode范围 |
\uAC00-\uD7AF | 韩文的unicode范围 |
\u3040-\u31FF | 日文的unicode范围 |
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!