正则表达式

正则表达式，简称regex，是文本模式的描述方法。比如，\d是一个正则表达式，表示一位数字字符，即任何一位0到9的数字。

Python使用正则表达式\d\d\d-\d\d\d-\d\d\d\d，来匹配这样一个字符串组合：3个数字、一个短横、4个数字。所有其他的字符串都不能匹配这个表达式。

正则表达式也可以很复杂。比如，在一个模式后面加上花括号包围的3{3}，是指将这个模式匹配3次。所以较短的正则表达式\d{3}-\d{3}-\d{4}。

colou?r 可以匹配 color 或者 colour，? 问号代表前面的字符最多只可以出现一次（0次、或1次）

runoo*b，可以匹配 runob、runoob、runoooooob 等，* 号代表字符可以不出现，也可以出现一次或者多次（0次、或1次、或多次）

runoo+b，可以匹配 runoob、runooob、runoooooob 等，+ 号代表前面的字符必须至少出现一次（1次或多次）

Python 中所有正则表达式的函数都在re模块中，在代码头部输入以下代码，导入该模块：

　　import re

向re.compile()传入一个字符串值，表示正则表达式，它将返回一个Regex模式对象（简称Regex对象）

例如：

　　phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

　　现在phoneNumRegex变量包含了一个Regex对象。

Regex对象的search()方法查找传入的字符串，寻找该正则表达式的所有匹配。如果字符串中没有找到该正则表达式模式，search()方法将返回None。如果找到了该模式，search()方法将会返回匹配的对象。该对象有一个group()方法，它返回被查找字符串中实际匹配的文本。

例：　　

　　import re
　　phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
　　mo = phoneNumRegex.search('My number is 415-555-4342.')
　　print('Phone number found: ' + mo.group())

它的输出为：

　　Phone number found: 415-555-434

利用括号分组：

　　如果想将区号从电话号码中分离。添加括号将在正则表达式中创建分组：(\d\d\d)-(\d\d\d-\d\d\d\d)。然后可以使用group()匹配对象方法，从一个分组中获取匹配文本。

例：

　　import re

　　phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')

　　mo = phoneNumRegex.search('My number is 415-555-4242.')

　　print(mo.group(1))

　　print(mo.group(2))

　　print(mo.group(0))

　　print(mo.group())

　　print(mo.groups())

　　areaCode,mainNumber = mo.groups()

　　print(areaCode)

　　print(mainNumber)

括号在正则表达式中有特殊含义，如果就是要在文本中匹配括号怎么办？可以用转移符\(反斜杠)

例：

　　import re

　　phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')

　　mo = phoneNumRegex.search('My phone number is (415) 555-4242.')

　　print(mo.group(1))

　　print(mo.group(2))

用管道符匹配多个分组

字符|称为管道符。希望匹配许多表达式中的一个时，就可以用它。例如表达式：r'Batman|Ironman'将匹配'Batman'或'Ironman'中的一个.

例：　　

　　import re

　　heroRegex = re.compile (r'Batman|Ironman')

　　mo1 = heroRegex.search('Batman and Ironman')
　　print(mo1.group())

　　mo2 = heroRegex.search('Ironman and Batman')
　　print(mo2.group())

　　现在Batman和Ironman都在被查找的字符蹿中，第一次出现的匹配文本将会被当做Match对象返回。

也可以使用管道来匹配多个模式中的一个，作为正则表达式的一部分。

假如我们希望匹配‘Batman’、‘Batmobile’、‘Batcopter’、‘Batbat’中的任意一个。可以通过括号实现：

　　import re
　　batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
　　mo = batRegex.search('Batmobile lost a wheel')
　　mo1 = batRegex.search('Batbat is not Catcat.')
　　print(mo.group())
　　print(mo.group(1))
　　print(mo1.group())
　　print(mo1.group(1))

用问号实现可选匹配

有的时候，想匹配的模式是可选的。就是说，不管这段文本存不存在，表达式都会认为匹配。字符？表明它前面的分组在这个模式中是可选的。例：

　　import re
　　batRegex = re.compile(r'Bat(wo)?man')
　　mo1 = batRegex.search('The Adventures of Batman')
　　print(mo1.group())
　　mo2 = batRegex.search('The Adventures of Batwoman')
　　print(mo2.group())

　　输出为

　　Batman
　　Batwoman

　　说明表达式匹配到了这两段字符串，如前面提到：? 问号代表前面的字符最多只可以出现一次（0次、或1次）

用星号匹配零次或多次，直接上例子吧：

　　import re
　　batRegex = re.compile(r'Bat(wo)*man')
　　mo1 = batRegex.search('The Adventures of Batman')
　　print(mo1.group())
　　mo2 = batRegex.search('The Adventures of Batwoman')
　　print(mo2.group())
　　mo3 = batRegex.search('The Adventures of Batwowowowoman')
　　print(mo3.group())

输出为：　　

Batman
Batwoman
Batwowowowoman

用加号匹配一次或多次，例：　　

　　import re
　　batRegex = re.compile(r'Bat(wo)+man')
　　mo1 = batRegex.search('The Adventures of Batman')
　　print(mo1 == None)
　　mo2 = batRegex.search('The Adventures of Batwoman')
　　print(mo2.group())
　　mo3 = batRegex.search('The Adventures of Batwowowowoman')
　　print(mo3.group())

输出为：

True
Batwoman
Batwowowowoman

第一个变量“mo1”因为无法匹配到模式所以没有返回值。

用花括号匹配特定的次数

　　如我们在开头提到的，如果想要一个分组或者字符重复特定次数，可以在表达式中该字符或分组后面跟上花括号{}包围的数字。例如，表达式(Ha){3}将匹配‘HaHaHa’，但不会匹配‘HaHa’

　　除了数字还可以指定一个范围，即在花括号中写一个最小值、一个逗号和一个最大值。如表达式(Ha){3,5}将匹配“HaHaHa”、“HaHaHaHa”和“HaHaHaHaHa”。

　　也可以不写花括号中的第一个或者第二个数字，不限定最小值或者最大值。例如，(Ha){3,}将匹配3次或更多，而(Ha){,5}将匹配0到5次

例：　　

　　import re
　　haRegex = re.compile(r'(Ha){3}')
　　mo1 = haRegex.search('HaHaHa')
　　print(mo1.group())

　　mo2 = haRegex.search('HaHa')
　　print(mo2 == None)

　　输出为：　　

　　HaHaHa
　　True

　　‘True’表示变量mo2没有匹配到模式，所以返回None。

花括号的贪心模式和非贪心模式，在字符串‘HaHaHaHaHa’中，因为(Ha){3,5}可以匹配3个、4个或5个实例。但是Python的正则表达式默认是“贪心”的，这表示在有二义的情况下，会尽可能匹配最长的字符串。而“非贪心”模式则尽可能匹配最短的字符，表达为在花括号后跟一个问号。例（比较下面两段代码的输出）：

import re
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())

************************************************************

import re
greedyHaRegex = re.compile(r'(Ha){3,5}?')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())

findall()方法

search方法只能查找字符串中第一次匹配的文本，而findall()方法将返回一组字符串，包含被查找字符串中的所有匹配。findall()不是返回一个Match对象，而是返回一个字符串列表。例：

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') #has no group
print(phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

posted @ 2018-10-13 23:13 ITdafei 阅读(675) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

正则表达式

公告