Python--正则表达模块re
正则表达式(regular expressiong)是用一种形式化语法描述的文本匹配模式。模式被解释为一组指令,然后会执行这组指令,以一个字符串作为输入,生成一个匹配的字迹或者原字符串的修改版本。“正则表达式”一词在讨论中通常会简写为“regex”或者“regexp”。表达式可以包括字面文字匹配、重复、模式组、分支以及其他复杂的规则。对于很多解析问题,用正则表达式解决会比创建特殊用途的词法分析器和语法分析器容易。
正则表达式通常在涉及大量文本处理的应用中使用。例如,在开发人员使用的文本编辑程序中,常用正则表达式作为搜索模式。另外,
正则表达式还是UNIX命令行工具的一个不可或缺的部分。如sed、grep和awk。很多编程语言都在语法中包括对正则表达式的支持,如Perl、Ruby、Awk和TCL。
另外一些语言(如C、C++和Python)通过扩展库来对正则表达式支持。Python的re模块中使用的语法以Perl所用的正则表达式语法为基础,
并提供了一些特定于Python的改进。
1. 查找文本中的模式(re.search(p,text))
- import re
- pattern = 'this'
- text = 'Does this text match the pattern?'
- match = re.search(pattern, text)
- s = match.start()
- e = match.end()
- print 'Found "%s" in:\n"%s"\nfrom %d to %d ("%s")' % \
- (match.re.pattern, match.string, s, e, text[s:e])
"Does this text match the pattern?"
from 5 to 9 ("this")
match对象
- string: 匹配时使用的文本。
- re: 匹配时使用的Pattern对象。
- pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
- endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
- lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组,将为None。
- lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组,将为None。
方法:
- group([group1, …]): 获得一个或多个分组截获的字符串;指定多个参数时将以元组形式返回。group1可以使用编号也可以使用别名;编号0代表整个匹配的子串;不填写参数时,返回group(0);没有截获字符串的组返回None;截获了多次的组返回最后一次截获的子串。
- groups([default]): 以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代,默认为None。
- groupdict([default]): 返回以有别名的组的别名为键、以该组截获的子串为值的字典,没有别名的组不包含在内。default含义同上。
- start([group]): 返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。group默认值为0。
- end([group]): 返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引+1)。group默认值为0。
- span([group]): 返回(start(group), end(group))。
- expand(template): 将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组,但不能使用编号0。\id与\g<id>是等价的;但\10将被认为是第10个分组,如果你想表达\1之后是字符'0',只能使用\g<1>0。
2. 编译表达式(re.compile(p))
- import re
- regexes = [re.compile(p) for p in ['this', 'that']]
- text = 'Does this text match the pattern?'
- print 'Text: %r\n' % text
- for regex in regexes:
- print 'Seeking "%s" ->' % regex.pattern,
- if regex.search(text):
- print 'match!'
- else:
- print 'no match'
Seeking "this" -> match!
Seeking "that" -> no match
3. 多重匹配(re.findall(p,text))
- import re
- text = 'abbaaabbbaaaa'
- pattern = 'ab*'
- print re.findall(pattern, text)
- import re
- text = 'abbaaabbbbaaaa'
- pattern = 'ab'
- for match in re.finditer(pattern, text):
- s = match.start()
- e = match.end()
- print "Found '%s' at %d:%d" % (text[s:e], s, e)
Found 'ab' at 5:7
4. 模式语法
重复与非贪婪模式
字符集
转义码
- pattern_list=[r'\d+', r'\D+', r'\s+', r'\w+',]
- pattern = r'\\.\+'
锚定
- pattern = r'^\w+' # 以word结束
- pattern = r'\A\w+' # 以word开始的字符串
- pattern = r'\w+\S*$' # 字符串尾部的word,不包括标点符号
- pattern = r'\w+\S*\Z' # 字符串尾部的word,不包括标点符号
- pattern = '\w*t\W*' # 含有t字母的word
- pattern = r'\bt\w' # 以t开头的word
- pattern = r'\w+t\b' # 以t结尾的word
- pattern = r'\Bt\B' # t既不在开头也不在结尾的word
这里注意,在环视中匹配的字符长度必须确定,不能使用长度不确定的字符,如*,+等。
5. 限制搜索
- import re
- text = 'This is some text -- with punctuation.'
- pattern = 'is'
- print 'Text :', text
- print 'Pattern :', pattern
- m = re.match(pattern, text)
- print "Match :", m
- s = re.search(pattern, text)
- print 'Search :', s
Pattern : is
Match : None
Search : <_sre.SRE_Match object at 0xb7626090>
- import re
- text = 'This is some his bist -- with punctuation.'
- pattern = re.compile(r'\b\w*is\w*\b')
- print "Text: ", text
- begin = 5
- end = 7
- match = pattern.search(text, begin, end)
- if match:
- s = match.start()
- e = match.end()
- print ' %2d :%2d = "%s"' % (s, e-1, text[s:e])
5 : 6 = "is"
6. 用组解析匹配(groups())
匹配组
- pattern = 'a(ab)' # a 后面跟着ab
- pattern = 'a(a*b*)' # a 后面跟着 0-n 个a 和 0-n 个 b
- pattern = 'a(ab)*' # a 后面跟着 0-n 个ab
- pattern = 'a(ab)+' # a 后面跟着 1-n 个 ab
任何完整的表达式都可以转换为组,并嵌套在一个更大的表达式中。所有重复修饰符可以应用到整个组作为一个整体,这就要求重复整个组模式。
- import re
- text = "This is some text -- with punctuation."
- print text
- patterns = [
- (r'^(\w+)', 'word at start of string'),
- (r'(\w+)\S*$', 'word at end, with optional punctuation'),
- (r'(\bt\w+)\W+(\w+)', 'word starting with t, another word'),
- (r'(\w+t)\b','word ending with t'),
- ]
- for pattern,desc in patterns:
- regex = re.compile(pattern)
- match = regex.search(text)
- print 'Pattern %r (%s)' % (pattern, desc)
- print ' ', match.groups()
Pattern '^(\\w+)' (word at start of string)
('This',)
Pattern '(\\w+)\\S*$' (word at end, with optional punctuation)
('punctuation',)
Pattern '(\\bt\\w+)\\W+(\\w+)' (word starting with t, another word)
('text', 'with')
Pattern '(\\w+t)\\b' (word ending with t)
('text',)
匹配单个组(Match.group(n))
- import re
- text = 'This is some text -- with punctuation'
- print 'Input text :', text
- # word starting with 't' then anoter word
- regex = re.compile(r'(\bt\w+)\W+(\w+)')
- print 'Pattern :', regex.pattern
- match = regex.search(text)
- print 'Entir match :',match.group(0)
- print 'Word starting with "t":', match.group(1)
- print 'Word after "t" word :', match.group(2)
Pattern : (\bt\w+)\W+(\w+)
Entir match : text -- with
Word starting with "t": text
Word after "t" word : with
命名组((?P<name>pattern))
- import re
- text = 'Text is some text -- with punctuation.'
- print text
- for pattern in [
- r'^(?P<first_word>\w+)',
- r'(?P<last_word>\w+)\S*$',
- r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
- r'(?P<ends_with_t>\w+t)\b',
- ]:
- regex = re.compile(pattern)
- match = regex.search(text)
- print 'Matching "%s"' % pattern
- print ' ', match.groups()
- print ' ', match.groupdict()
Matching "^(?P<first_word>\w+)"
('Text',)
{'first_word': 'Text'}
Matching "(?P<last_word>\w+)\S*$"
('punctuation',)
{'last_word': 'punctuation'}
Matching "(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)"
('text', 'with')
{'other_word': 'with', 't_word': 'text'}
Matching "(?P<ends_with_t>\w+t)\b"
('Text',)
{'ends_with_t': 'Text'}
反向引用
候选模式((patter1)|(pattern2))
非捕获组((?:pattern))
- pattern = r'(\d+)(?:\.?)(?:\d+)([¥$])$'
这样还是可以用M.group(1)和M.group(2)作为输出,结果同样也是”8000“和”¥“
7. 搜索选项
不分区大小写(re.IGNORECASE,re.DOTALL)
IGNORECASE使模式中的字面量字符和字符区间与大小写字符都匹配。- with_case = re.compile(pattern,re.IGNORECASE)
多行输入(re.MULTILINE)
- import re
- text = 'This is some text -- with puncturation.\nA second line.'
- pattern = r'(^\w+)|(\w+\S*$)'
- single_line = re.compile(pattern)
- multiline = re.compile(pattern, re.MULTILINE)
- print 'Text:\n %r' % text
- print 'Pattern:\n %s' % pattern
- print 'Single Line:'
- for match in single_line.findall(text):
- print ' %r' % (match,)
- print 'MULTILINE :'
- for match in multiline.findall(text):
- print ' %r' % (match,)
MULTILINE会将"\n"解释为一个换行,而默认情况下只将其解释为一个空白字符。执行结果为:
- import re
- text = 'This is some text -- with punctuation.\nA second line.'
- pattern = r'.+'
- no_newlines = re.compile(pattern)
- dotall = re.compile(pattern, re.DOTALL)
- print 'Text:\n %r'% text
- print 'Pattern:\n %r' % pattern
- print 'No newlines :'
- for match in no_newlines.findall(text):
- print ' %r' % match
- print 'Dotall :'
- for match in dotall.findall(text):
- print ' %r' % match
'This is some text -- with punctuation.\nA second line.'
Pattern:
'.+'
No newlines :
'This is some text -- with punctuation.'
'A second line.'
Dotall :
'This is some text -- with punctuation.\nA second line.'
Unicode
详细表达式(re.VERBOSE)
- import re
- address = re.compile(
- """
- # 命名组name,其中可能包含'.'
- # for title abbreviations and middle initials.
- (
- (?P<name>([\w.,]+\s+)*[\w.,]+)
- \s*
- # Email addresses are wrapped in angle
- # 括号: <> 仅当name组被找到时才可匹配,
- # 因此将前括号保存在该组中
- <
- )? # 前面的名字可有可无
- # 匹配邮件地址: username@demain.tld
- (?P<email>
- [\w\d.+-]+ # username
- @
- ([\w\d.]+\.)+ # 邮箱域名前缀
- (com|org|deu) # 限制邮件结束域名后缀
- )
- >? # 结束括号
- """,
- re.UNICODE | re.VERBOSE
- )
- candidates = [
- u'first.last@example.com',
- u'first.last+category@gmail.com',
- u'valid-address@mail.example.com',
- u'not-valid@example.foo',
- u'First Last <first.last@example.com>',
- u'No Brackets first.last@example.com',
- u'First Last',
- u'First Middle Last <first.last@example.com>',
- u'Fist M. Last <first.last@example.com>',
- u'<first.last@example.com>',]
- for cd in candidates:
- print 'Candidate:', cd
- match = address.search(cd)
- if match:
- print ' Name:', match.groupdict()['name']
- print ' Email:', match.groupdict()['email']
- else:
- print ' No match'
类似于其他编程语言,能够在详细正则表达式中插入注释有利于增强可读性和可维护性。执行结果为:
Name: None
Email: first.last@example.com
Candidate: first.last+category@gmail.com
Name: None
Email: first.last+category@gmail.com
Candidate: valid-address@mail.example.com
Name: None
Email: valid-address@mail.example.com
Candidate: not-valid@example.foo
No match
Candidate: First Last <first.last@example.com>
Name: First Last
Email: first.last@example.com
Candidate: No Brackets first.last@example.com
Name: None
Email: first.last@example.com
Candidate: First Last
No match
Candidate: First Middle Last <first.last@example.com>
Name: First Middle Last
Email: first.last@example.com
Candidate: Fist M. Last <first.last@example.com>
Name: Fist M. Last
Email: first.last@example.com
Candidate: <first.last@example.com>
Name: None
Email: first.last@example.com
模式中嵌入标志
- pattern =r' (?i)\bT\w+'
标志 | 缩写 |
IGNORECASE | i |
MULTLINE | m |
DOTALL | s |
UNICODE | u |
VERBOSE | x |
8. 自引用表达式
- <pre name="code" class="python">address = re.compile(
- r"""
- #匹配姓名 name
- (\w+) # first name
- \s+
- (([\w.]+)\s+)? # optional middle name or initial
- (\w+) # last name
- \s+
- <
- # 邮箱地址: first_name.last_name@domain.tld
- (?P<email>
- \1 # first name
- \.
- \4 # last name
- @
- ([\w\d.]+\.)+
- (com|org|edu)
- )
- >
- """,
- re.UNICODE | re.VERBOSE | re.IGNORECASE
- )
- <pre name="code" class="python">address = re.compile(
- r"""
- # The regular name
- (?P<first_name>\w+) # first name
- \s+
- (([\w.]+)\s+)? # optional middle name or initial
- (?P<last_name>\w+) # last name
- \s+
- <
- # The address: first_name.last_name@domain.tld
- (?P<email>
- (?P=first_name) # first name
- \.
- (?P=last_name)# last name
- @
- ([\w\d.]+\.)+
- (com|org|edu)
- )
- >
- """,
- re.UNICODE | re.VERBOSE | re.IGNORECASE
- )
- <pre name="code" class="python">address = re.compile(
- r"""
- ^
- # 首先匹配姓名,可能包含 "."
- (?P<name>
- ([\w.]+\s+)*[\w.]+
- )?
- \s* #0-n个空白
- # 仅当name组匹配成功时,采用非捕获的方式匹配括号
- (?(name)
- # 采用非捕获肯定顺序环视<.*>
- # 并命名为bracket组
- (?P<brackets>
- (?=(<.*>$))
- )
- |
- # 若name组没有匹配成功
- # 则后面不能跟<且结尾不能是>,同样采用非捕获组的方式
- (?=([^<].*[^>]$))
- )
- # 如果brackets组匹配成功,则开始匹配<,否则匹配空白符
- (?(brackets)< | \s*)
- # 匹配email地址: username@domain.tld
- (?P<email>
- [\w\d.+-]+ # username
- @
- ([\w\d.]+\.)+ #邮箱域名前缀
- (com | org | edu) #邮箱域名后缀
- )
- # 如果brackets组匹配成功,则开始匹配>,否则匹配空白符
- (?(brackets)>|\s*)
- $
- )
- """,
- re.UNICODE | re.VERBOSE | re.IGNORECASE
- )
9. 用模式修改字符串(sub())
- import re
- bold = re.compile(r'\*{2}(.*?)\*{2}')
- text = 'Make this **bold**. This **too**.'
- print 'Text:', text
- print 'Bold:',bold.sub(r'<b>\1</b>', text)
Bold: Make this <b>bold</b>. This <b>too</b>.
- bold2 = re.compile(r'\*{2}(?P<bold_text>.*?)\*{2}',re.UNICODE)
- print 'Text:', text
- print 'Bold:',bold2.sub(r'<b>\g<bold_text></b>', text)
Bold: Make this <b>bold</b>. This <b>too</b>.
- import re
- bold = re.compile(r'\*{2}(.*?)\*{2}', re.UNICODE)
- text = 'Make this **bold**. This **too**.'
- print 'Text:', text
- print 'Bold:',bold.sub(r'<b>\1</b>', text, count = 1)
Bold: Make this <b>bold</b>. This **too**.
10. 利用模式拆分
- import re
- text = """Paragraph one
- on two lines.
- Paragraph two.
- Paragraph three.
- """
- for num, para in enumerate(re.findall(r'(.+?)\n{2,}',
- text,
- flags=re.DOTALL)):
- print num, repr(para)
1 'Paragraph two.'
可以扩展这个模式,指出段落以两个或者多个换行符结束或者以输入末尾作为结束,
- import re
- text = """Paragraph one
- on two lines.
- Paragraph two.
- Paragraph three.
- """
- text2 = 'one1two2three3four4'
- for num, para in enumerate(re.split(r'\n{2,}', text)):
- print num, repr(para)
- for num, para in enumerate(re.split(r'\d+', text2)):
- print num, repr(para)
0 'Paragraph one\non two lines.'
1 'Paragraph two.'
2 'Paragraph three.\n'
0 'one'
1 'two'
2 'three'
3 'four'
4 ''
- import re
- text = """Paragraph one
- on two lines.
- Paragraph two.
- Paragraph three.
- """
- text2 = 'one1two2three3four4'
- for num, para in enumerate(re.split(r'(\n{2,})', text)):
- print num, repr(para)
- for num, para in enumerate(re.split(r'(\d+)', text2)):
- print num, repr(para)
1 '\n\n'
2 'Paragraph two.'
3 '\n\n\n'
4 'Paragraph three.\n'
0 'one'
1 '1'
2 'two'
3 '2'
4 'three'
5 '3'
6 'four'
7 '4'
8 ''
如果要用圆括号但又不想分隔符在结果中,那么可以使用非捕获组(:?):
- import re
- line = 'one two; three, four, five,six, seven'
- result1 = re.split(r'[;,\s]\s*', line)
- print result1
- result2 = re.split(r'(;|,|\s)\s*', line)
- print result2
- values = result2[::2]
- delimiters = result2[1::2]+['']
- print values
- print delimiters
- print ''.join(v+d for v,d in zip(values, delimiters))
- result3 = re.split(r'(?:;|,|\s)\s*', line)
- print result3
result1:
['one', 'two', 'three', 'four', 'five', 'six', 'seven']
result2:
['one', ' ', 'two', ';', 'three', ',', 'four', ',', 'five', ',', 'six', ',', 'seven']
['one', 'two', 'three', 'four', 'five', 'six', 'seven']
[' ', ';', ',', ',', ',', ',', '']
one two;three,four,five,six,seven
result3:
['one', ' ', 'two', ';', 'three', ',', 'four', ',', 'five', ',', 'six', ',', 'seven']
['one', 'two', 'three', 'four', 'five', 'six', 'seven']