5.2.2 re模块方法与正则表达式对象

　　Python标准库re提供了正则表达式操作所需要的功能，既可以直接使用re模块中的方法，来实现，也可以把模式编译成正则表达式对象再使用。

方法	功能说明
complie(pattern[,flagss])	创建模式对象
search(pattern,string[,flags])	在整个字符串中寻找模板，返回match对象或None
match(pattern,string[,flags])	从字符串开始处匹配模式，返回match对象或None
findall(pattern,string[,flags])	列出字符串中模式的所有匹配项
split(pattern,string[,maxsplit=0])	根据模式匹配项分隔字符串
sub(pat,repl,string[,count=0])	将字符串中所有pat的匹配项用repl替换
escape(string)	将字符串中所有特殊正则表达式字符转义

　　其中，函数参数flags的值可以是re.I(表示忽略大小写)、re.L(表示支持本地字符集)、re.M(多行匹配模式)、re.S(使元字符"."匹配任意字符，包括换行符)、re.U(匹配Unicode字符)、re.X(忽略模式中的空格，并可以使用#注释)的不同组合(使用"|"进行组合)。

　　1 直接使用re模块

 1 >>> import re
 2 >>> text = 'alpha.beta...gamma delta'   #测试用的字符串
 3 >>> re.split('[\. ]+',text)   #使用指定字符作为分隔符进行字符串拆分
 4 ['alpha', 'beta', 'gamma', 'delta']
 5 >>> 
 6 >>> re.split('[\. ]+',text,maxsplit=2)  #最多分隔两次
 7 ['alpha', 'beta', 'gamma delta']
 8 >>> 
 9 >>> re.split('[\. ]+',text,maxsplit=1)  #最多分隔一次
10 ['alpha', 'beta...gamma delta']
11 >>> 
12 >>> pat = '[a-zA-Z]+'
13 >>> re.findall(pat,text)     #查找所有单词
14 ['alpha', 'beta', 'gamma', 'delta']
15 >>> 
16 >>> pat = '{name}'
17 >>> text = 'Dear {name}...'
18 >>> re.sub(pat,'Mr.Dong',text)  #字符串替换
19 'Dear Mr.Dong...'
20 >>> 
21 >>> s = 'a s d'
22 >>> re.sub('a|s|d','goog',s)
23 'goog goog goog'
24 >>> 
25 >>> re.escape('http://www.python.org')  #字符串转义
26 'http\\:\\/\\/www\\.python\\.org'
27 >>> 
28 >>> print(re.match('done|quit','done')) #匹配成功，返回mathc对象
29 <_sre.SRE_Match object; span=(0, 4), match='done'>
30 >>> 
31 >>> print(re.match('done|quit','done!'))
32 <_sre.SRE_Match object; span=(0, 4), match='done'>
33 >>> 
34 >>> print(re.match('done|quit','doe')) #匹配不成功返回，返回空值None
35 None
36 >>> 
37 >>> print(re.search('done|quit','d!one!done')) #匹配成功
38 <_sre.SRE_Match object; span=(6, 10), match='done'>
39 >>>

　　下面的代码使用不同的方法删除字符串中多余的空格，如果遇到连续多个空格子只保留一个，同时删除字符串两侧的所有空白字符。

 1 >>> import re
 2 >>> s = 'aaa      bb   c d e      fff       '
 3 >>> ' '.join(s.split())   #不适用正则表达式，直接使用字符串对象的方法
 4 'aaa bb c d e fff'
 5 >>> 
 6 >>> re.split('[\s]+',s)
 7 ['aaa', 'bb', 'c', 'd', 'e', 'fff', '']
 8 >>> 
 9 >>> re.split('[\s]+',s.strip())  #同时使用re模块中的方法和字符串对象方法
10 ['aaa', 'bb', 'c', 'd', 'e', 'fff']
11 >>> 
12 >>> ' '.join(re.split('[\s]+',s.strip()))
13 'aaa bb c d e fff'
14 >>> 
15 >>> ' '.join(re.split('\s+',s.strip()))
16 'aaa bb c d e fff'
17 >>> 
18 >>> 
19 >>> #直接使用re模块的字符串替换方法
20 >>> re.sub('\s+',' ',s.strip())
21 'aaa bb c d e fff'
22 >>>

　　下面的代码使用以"\"开头的元字符来实现字符串的特定搜索。

 1 >>> import re
 2 >>> example = 'ShanDong Institute of Business and Technology is a very beautiful school.'
 3 >>> 
 4 >>> re.findall('\\ba.+?\\b',example) #以字母a开头的完整单词，“？”表示非贪心模式
 5 ['and', 'a ']
 6 >>> 
 7 >>> re.findall('\\ba.+\\b',example)  #贪心模式的匹配结果
 8 ['and Technology is a very beautiful school']
 9 >>> 
10 >>> re.findall('\\ba\w* \\b',example)
11 ['and ', 'a ']
12 >>> 
13 >>> re.findall('\\Bo.+?\\b',example) #不以o开头且含有o字母的单词剩余部分
14 ['ong', 'ology', 'ool']
15 >>> 
16 >>> re.find('\\b\w.+?\\b',example)  #所有单词
17 Traceback (most recent call last):
18   File "<pyshell#39>", line 1, in <module>
19     re.find('\\b\w.+?\\b',example)  #所有单词
20 AttributeError: module 're' has no attribute 'find'
21 >>> 
22 >>> re.findall('\\b\w.+?\\b',example)  #所有单词
23 ['ShanDong', 'Institute', 'of', 'Business', 'and', 'Technology', 'is', 'a ', 'very', 'beautiful', 'school']
24 >>> 
25 >>> re.findall('\w+',example)   #所有单词
26 ['ShanDong', 'Institute', 'of', 'Business', 'and', 'Technology', 'is', 'a', 'very', 'beautiful', 'school']
27 >>> 
28 >>> re.findall(r'\b\w.+?\b',example)   #使用原始字符串
29 ['ShanDong', 'Institute', 'of', 'Business', 'and', 'Technology', 'is', 'a ', 'very', 'beautiful', 'school']
30 >>> 
31 >>> re.split('\s',example)  #使用任何空白字符分隔字符串
32 ['ShanDong', 'Institute', 'of', 'Business', 'and', 'Technology', 'is', 'a', 'very', 'beautiful', 'school.']
33 >>> 
34 >>> re.findall('\d+\.\d+\.\d+','Python 2.7.11')  #查找并返回x.x.x形式的数字
35 ['2.7.11']
36 >>> 
37 >>> re.findall('\d+\.\d+\.\d+','Python 2.7.11,Python 3.5.1')
38 ['2.7.11', '3.5.1']
39 >>>

　　2 使用正则表达式对象

　　首先使用re模块的compile()方法将正则表达式编译生成正则表达式对象,然后再使用正则表达式对象提供的方法进行字符串处理。使用编译后的正则表达式对象不仅可以提高字符串处理速度，还提供了更加强大的字符串处理功能。

　　正则表达式对象的match(string[,pos[,endpos]])方法用于在字符串开头或指定位置进行搜索，模式必须出现在字符串开头或指定位置；serach(string[,pos[,endpos]])方法用于在整个字符串或指定范围中进行搜索；findall(string[,pos[endpos]])方法用于在字符串中查找所有符合正则表达式的字符串并以列表形式返回。

 1 >>> import re
 2 >>> example = 'ShanDong Institute of Business and Technology'
 3 >>> 
 4 >>> #编译正则表达式对象，查找以B开头的单词
 5 >>> pattern = re.compile(r'\bB\w+\b')
 6 >>> 
 7 >>> #使用正则表达式的findall()方法
 8 >>> pattern.findall(example)
 9 ['Business']
10 >>> 
11 >>> #查找以字母g结尾的单词
12 >>> pattern = re.compile(r'\w+g\b')
13 >>> pattern.findall(example)
14 ['ShanDong']
15 >>> 
16 >>> #查找3个字母长的单词
17 >>> pattern = re.compile(r'\b[a-zA-Z]{3}\b')
18 >>> pattern.findall(example)
19 ['and']
20 >>> 
21 >>> #从字符串开头开始匹配，失败返回空值
22 >>> pattern.match(exampke)
23 Traceback (most recent call last):
24   File "<pyshell#18>", line 1, in <module>
25     pattern.match(exampke)
26 NameError: name 'exampke' is not defined
27 >>> 
28 >>> pattern.match(example)
29 >>> 
30 >>> #从整个字符串中搜索，成功
31 >>> pattern,search(example)
32 Traceback (most recent call last):
33   File "<pyshell#23>", line 1, in <module>
34     pattern,search(example)
35 NameError: name 'search' is not defined
36 >>> pattern.search(example)
37 <_sre.SRE_Match object; span=(31, 34), match='and'>
38 >>> 
39 >>> #查找所有含有字母a的单词
40 >>> pattern = re.compile(r'\b\w*a\w* \b')
41 >>> pattern.findall(example)
42 ['ShanDong ', 'and ']
43 >>> 
44 >>> 
45 >>> text = 'He was carefully disguised but captured quickly by police.'
46 >>> 
47 >>> #查找所有以字母组合ly结尾的单词
48 >>> re.findall(r'\w+ly',text)
49 ['carefully', 'quickly']
50 >>>

　　正则表达式对象的sub(repl,string[,count=0])和subn(repl,string[,count=0])方法来实现字符串替换功能。

 1 >>> import re
 2 >>> example = '''Beautiful is better than ugly.
 3 Explicit is better than implicit.
 4 Simple is better than complex.
 5 Complex is better than complicated.
 6 Flat is better than nested.
 7 Sparse is better than dense.
 8 Readability counts.'''
 9 >>> #正则表达式对象，匹配以b或B开头的单词
10 >>> pattern = re.compile(r'\bb\w* \b',re.I)
11 >>> 
12 >>> #将符合条件的单词替换为*
13 >>> pattern.sub('*',example)
14 '*is *than ugly.\nExplicit is *than implicit.\nSimple is *than complex.\nComplex is *than complicated.\nFlat is *than nested.\nSparse is *than dense.\nReadability counts.'
15 >>> 
16 >>> #只替换一次
17 >>> pattern.sub('*',example,1)
18 '*is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nFlat is better than nested.\nSparse is better than dense.\nReadability counts.'
19 >>> 
20 >>> #匹配以字母b开头的单词
21 >>> pattern = re.compile(r'\bb\w* \b')
22 >>> #将符合条件的单词替换为*，只替换一次
23 >>> pattern.sub('*',example)
24 'Beautiful is *than ugly.\nExplicit is *than implicit.\nSimple is *than complex.\nComplex is *than complicated.\nFlat is *than nested.\nSparse is *than dense.\nReadability counts.'
25 >>>

　　正则表达式对象的split(string[,maxsplit=0)方法用来实现字符串分隔。

 1 >>> import re
 2 >>> 
 3 >>> example = r'one,two,three.four/five\six？seven[eight]nine|ten'
 4 >>> 
 5 >>> pattern = re.compile(r'[,./\\?[\]\|]')
 6 >>> 
 7 >>> pattern.split(example)
 8 ['one', 'two', 'three', 'four', 'five', 'six？seven', 'eight', 'nine', 'ten']
 9 >>> 
10 >>> example = r'one1two2three3four4five5six6seven7eight8nine9ten'
11 >>> #使用数字分隔符

>>> pattern = re.compile(r'\d+')
>>> pattern.split(example)
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
>>>

14 >>> 
15 >>> pattern = re.compile(r'[\s,.\d]+')  #允许分隔符重复
16 >>> pattern.split(example)
17 ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
18 >>>

　　3 match 对象

　　正则表达式模块或正则表达式对象的match()方法和search()方法匹配成功后都会返回match对象。

　　match()对象的主要方法有：

　　　　group()（返回匹配的一个或多个子模式内容）、

　　　　groups()（返回一个包含匹配的所有子模式内容的元组）、

　　　　groupdict()（返回包含匹配的所有命名子模式内容的字典）、

　　　　start()（返回指定子模式内容的起始位置）、

　　　　end()（返回指定子模式内容的结束位置的前一个位置）、

　　　　span()（返回一个包含指定子模式内容起始位置和结束位置的前一个位置的元组）等。

　　下面的代码使用几种不同的方法来删除字符串中指定的内容。

 1 >>> import re
 2 >>> email = "tony@tiremove_thisger.net"
 3 >>> 
 4 >>> #使用search()方法返回的match对象
 5 >>> 
 6 >>> m = re.search("remove_this",email)
 7 >>> 
 8 >>> #字符串切片
 9 >>> email[:m.start()] + email[m.end():]
10 'tony@tiger.net'
11 >>> 
12 >>> #直接使用re模块的sub()方法
13 >>> re.sub('remove_this','',email)
14 'tony@tiger.net'
15 >>> 
16 >>> #也可以使用字符串替换方法
17 >>> email .replace('remove_this','')
18 'tony@tiger.net'
19 >>>

　　下面的代码演示了match对象的group()、groups()与groupdict()以及其他方法的用法：

 1 >>> m = re.match(r'(\w+)(\w+)','Isaac Newton,physicist')
 2 >>> #返回整个模式内容
 3 >>> m.group(0)
 4 'Isaac'
 5 >>> m = re.match(r'(\w+) (\w+)','Isaac Newton,physicist')
 6 >>> m.group(0)
 7 'Isaac Newton'
 8 >>> 
 9 >>> #返回第1个子模式内容
10 >>> m.group(1)
11 'Isaac'
12 >>> 
13 >>> #返回第2个子模式内容
14 >>> m.group(2)
15 'Newton'
16 >>> 
17 >>> #返回指定的多个子模式内容
18 >>> m.group(1,2)
19 ('Isaac', 'Newton')
20 >>>

　　下面的代码演示了子模式扩展语法的用法：

 1 >>> import re
 2 >>> 
 3 >>> #使用命名子模式
 4 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)","Malcolm Reynolds")
 5 >>> m.group('first_name')
 6 'Malcolm'
 7 >>> 
 8 >>> m.group('last_name')
 9 'Reynolds'
10 >>> 
11 >>> m = re.match(r'(\d+)\.(\d+)','24.1632')
12 >>> m.groupdict()   #以字典形式返回匹配的结果
13 {}
14 >>> m.groups()    #返回所有匹配的子模式（不包括第0个）
15 ('24', '1632')
16 >>> 
17 >>> 
18 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)","Malcolm Reynolds")
19 >>> #以字典形式返回匹配结果
20 >>> m.groupdict()
21 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
22 >>> 
23 >>> 
24 >>> 
25 >>> 
26 >>> exampleString = '''There should be one--and preferabley only one --obvious way to do it.
27 Although taht way not be obvious at first unless you're Dutch.
28 Now is better than never.
29 Although never is often better than right now.'''
30 >>> pattern = re.compile(r'(?<=\w\s)never(?=\s\w)')   #查找不在句子开头和结尾的never
31 >>> matchResult = pattern.search(exampleString)
32 >>> matchResult.span()
33 (168, 173)
34 >>> 
35 >>> 
36 >>> #查找位于句子末尾的单词
37 >>> pattern = re.compile(r'(?<=\w\s)never')   #查找位于句子末尾的单词
38 >>> matchResult = pattern.search(exampleString)
39 >>> matchResult.span()
40 (152, 157)
41 >>> 
42 >>> #查找前面是is的better than组合
43 >>> pattern = re.compile(r'(?:is\s)better(\sthan)')
44 >>> 
45 >>> matchResult = pattern.search(exampleString)
46 >>> matchResult.span()
47 (137, 151)
48 >>> 
49 >>> 
50 >>> #组 0 表示整个模式
51 >>> matchResult.group(0)
52 'is better than'
53 >>> 
54 >>> matchResult.group(1)
55 ' than'
56 >>> 
57 >>> 
58 >>> #查找以n或N字母开头的所有单词
59 >>> pattern = re.compile(r'\b(?i)n\w+\b')
60 >>> index = 0
61 >>> while True:
62     matchResult = pattern.search(exampleString,index)
63     if not matchResult:
64         break
65     print(matchResult.group(0),':',matchResult.span(0))
66     index = matchResult.end(0)
67 
68     
69 not : (88, 91)
70 Now : (133, 136)
71 never : (152, 157)
72 never : (168, 173)
73 now : (201, 204)
74 >>> 
75 >>> pattern = re.compile(r'(?<!not\s)be\b')   #查找前面没有单词not的单词be

posted @ 2018-04-08 11:18 Avention 阅读(873) 评论(0) 编辑收藏举报

刷新页面返回顶部

Avention

5.2.2 re模块方法与正则表达式对象

公告