Python之re正则模块二
13、编译的标志
可以用re.I、re.M等参数,也可以直接在表达式中添加"?(iLmsux)"标志
*s:单行,“.”匹配包括换行符在内的所有字符
*i:忽略大小写
*L:让"\w"能匹配当地字符,貌似对中文支持不好
*m:多行
*x:忽略多余的空白字符,让表达式更易阅读
*u:Unicode
例子:
>>> re.findall(r"[a-z]+","%123Abc%45xyz&") ['bc', 'xyz'] >>> re.findall(r"[a-z]+","%123Abc%45xyz&",re.I) ['Abc', 'xyz'] >>> >>> re.findall(r"(?i)[a-z]+","%123Abc%45xyz&",re.I) ['Abc', 'xyz']
更好的格式:
>>> pattern=r""" ... (\d+) #number ... ([a-z]+) #letter ... """ >>> >>> re.findall(pattern,"%123Abc\n%45xyz&",re.i | re.S |re.x) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'i'
#由错误可见是大写
>>> re.findall(pattern,"%123Abc\n%45xyz&",re.I | re.S |re.X) [('123', 'Abc'), ('45', 'xyz')] >>>
组操作
命名组:(?P<name>...)
>>> for m in re.finditer(r"(?P<digit>(\d+))(?P<letter>([a-z]+))","%123Abc%45xyz&",re.I): ... print m.groupdict() ... {'digit': '123', 'letter': 'Abc'} {'digit': '45', 'letter': 'xyz'}
无捕获组:(?:...),作为匹配条件,但不返回:
>>> for m in re.finditer(r"(?:(\d+))(?P<letter>([a-z]+))","%123Abc%45xyz&",re.I): ... print m.groupdict() ... {'letter': 'Abc'} {'letter': 'xyz'}
反向引用:\<number>或者(?P=name),引用前面的组:
>>> for m in re.finditer(r"<(\w)>\w+</(\1)>","%<a>123Abc</a>%<b>45xyz</b>&%"): ... print m.group() ... <a>123Abc</a> <b>45xyz</b>
>>> for m in re.finditer(r"<(?P<tag>\w)>\w+</(?P=tag)>","%<a>123Abc</a>%<b>45xyz</b>&%"): ... print m.group() ... <a>123Abc</a> <b>45xyz</b>
正声明(?=...):组内容必须出现在右侧,不返回
负声明(?!...):组内容不能出现在右侧,不返回
反向正声明(?<=):组内容必须出现在左侧,不返回
反向负声明(?<!):组内容不能出现左侧,不返回
>>> for m in re.finditer(r"\d+(?=[ab])","%123Abc%45xyz%780b&",re.I): ... print m.group() ... 123 780
>>> for m in re.finditer(r"(?<!\d)[a-z]{3,}","%123Abc%45xyz%bysc&",re.I): ... print m.group() ... bysc
修改
split:用pattern做分割符切割字符串。如果用“(pattern)”,那么分隔符也会返回。
>>> re.split(r"\W","abc,123,x") ['abc', '123', 'x'] >>> re.split(r"(\W)","abc,123,x") ['abc', ',', '123', ',', 'x']
#将pattern使用括号引用起来,也返回分隔符
split(pattern, string, maxsplit=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
sub:替换子串,可指定替换次数:
>>> re.split(r"(\W)","abc,123,x") ['abc', ',', '123', ',', 'x'] >>> re.sub(r"[a-z]+","*","abc,123,x") '*,123,*' >>> >>> re.sub(r"[a-z]+","*","abc,123,x",1) '*,123,x'
sub(pattern, repl, string, count=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the match object and must return a replacement string to be used.
subn()和sub()差不多,不过返回"(新字符串,替换次数)":
>>> re.subn(r"\W","*","abc,123,x") ('abc*123*x', 2)
还可以将替换字符串改成函数,以便替换成不同的结果:
>>> def repl(m): ... print m.group() ... return "*" *len(m.group()) ... >>> re.subn(r"[a-z]+",repl,"abc,123,x") abc x ('***,123,*', 2) >>>