python自动化0-正则表达式

1，re模块: compile search group groups findall sub

匹配方法一：search search()将返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本

***Re = re.compile(r'正则表达式')
mo = ***Re.search('匹配对象')
如果所用的正则表达式没有分组
res = mo.group()
如果所用的正则表达式分组了
res = mo.group()  与  res = mo.group(0)  #返回完整的匹配文本
res = mo.group(1)  等返回各分组匹配文本
res = mo.groups()  #返回的是一个字符串列表

匹配方法二：findall findall()方法将返回一组字符串，包含被查找字符串中的所有匹配文本

'''search'''
phoneNumRe = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRe.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'''findall，可以命令行中输入看看有什么区别'''
phoneNumRe = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phoneNumRe.findall('Cell: 415-555-9999 Work: 212-555-0000')
mo

方法三：sub sub()方法需要传入两个参数。第一个参数是一个字符串，用于取代发现的匹配。第二个参数是一个字符串，即正则表达式。 sub()方法返回替换完成后的字符串。

namesRe = re.compile(r'Agent \w+')
mo = namesRe.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
mo

有时候，可能需要使用匹配的文本本身，作为替换的一部分。在 sub()的第一个参数中，可以输入\1、 \2、 \3……。表示“在替换中输入分组 1、 2、 3……的文本”。例如，假定想要隐去密探的姓名，只显示他们姓名的第一个字母。要做到这一点，可以使用正则表达式 Agent (\w)\w，传入 r'\1***'作为 sub()的第一个参数。字符串中的\1 将由分组 1 匹配的文本所替代，也就是正则表达式的(\w)分组。

agentNamesRe = re.compile(r'Agent (\w)\w*')
mo = agentNamesRe.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')
mo

2，分组()以及分组编号范围[] 匹配次数{}

'''分组()'''
phoneNumRe = re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phoneNumRe.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()
mo.group(0)
mo.group(1)
mo.group(2)
mo.groups()
mo = phoneNumRe.findall('Cell: 415-555-9999 Work: 212-555-0000')
mo
'''分组编号，从头阅读正则表达式，每遇到一个左括号就计数加一。'''
datePattern = re.compile(r"""^(.*?) # all text before the date
                              ((0|1)?\d)- # one or two digits for the month
                              ((0|1|2|3)?\d)- # one or two digits for the day
                              ((19|20)\d\d) # four digits for the year
                              (.*?)$ # all text after the date
                              """, re.VERBOSE)
datePattern = re.compile(r"""^(1) # all text before the date
                              (2 (3) )- # one or two digits for the month
                              (4 (5) )- # one or two digits for the day
                              (6 (7) ) # four digits for the year
                              (8)$ # all text after the date
                              """, re.VERBOSE)
'''范围[]：[a-z]匹配指定范围内的任意字符，[^a-z]匹配任何不在指定范围内的任意字符'''
phoneNumRe = re.compile(r'[0-9]{3}-[0-9]{3}-[0-9]{4}')
'''匹配次数{}：{n,m}最少匹配n次且最多匹配m次，{n,}至少匹配n次，{n}匹配确定的n次'''
haRe = re.compile(r'(Ha){3}')
mo = haRe.search('HaHaHaHaHa')
mo.group()

3，管道字符 |

'''管道字符 | 希望匹配许多表达式中的一个时，就可以使用它，search 匹配出现的第一个文本，findall 匹配所有文本'''
heroRe = re.compile(r'Batman|Tina Fey')
mo = heroRe.search('Batman and Tina Fey.')
mo.group()

heroRe = re.compile(r'Batman|Tina Fey')
mo = heroRe.findall('Batman and Tina Fey.')
mo

4，? + *

'''
?  匹配前面的子表达式零次或一次
+  匹配前面的子表达式一次或多次
*  匹配前面的子表达式零次或多次
'''
batRe = re.compile(r'Bat(wo)?man')
mo1 = batRe.search('The Adventures of Batman')
mo1.group()

batRe = re.compile(r'Bat(wo)+man')
mo2 = batRe.search('The Adventures of Batwowowowoman')
mo2.group()

batRe = re.compile(r'Bat(wo)*man')
mo1 = batRe.search('The Adventures of Batman')
mo1.group()
mo2 = batRe.search('The Adventures of Batwowowowoman')
mo2.group()

5，贪心与非贪心匹配 ?

'''
Python的正则表达式默认是“贪心”的，这表示在有二义的情况下，它们会尽可能匹配最长的字符串。
花括号的“非贪心” 版本匹配尽可能最短的字符串，即在结束的花括号后跟着一个问号。
'''
nongreedyHaRe = re.compile(r'(Ha){3,5}?')
mo = nongreedyHaRe.search('HaHaHaHaHa')
mo.group()

6，字符分类

'''
\d     0 到 9 的任何数字
\D     除 0 到 9 的数字以外的任何字符
\w     任何字母、数字或下划线字符（可以认为是匹配“单词”字符）
\W     除字母、数字和下划线以外的任何字符
\s     空格、制表符或换行符（可以认为是匹配“空白”字符）
\S     除空格、制表符和换行符以外的任何字符
.      “通配符”，匹配除了换行之外的所有字符。通过传入 re.DOTALL 作为 re.compile()的第二个参数，可以让句点字符匹配所有字符，包括换行字符。
(.*)   任意文本
'''
xmasRe = re.compile(r'\d+\s\w+')
mo = xmasRe.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
mo

nameRe = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRe.search('First Name: Al Last Name: Sweigart')
mo.groups()

7，^ $

'''
可以在正则表达式的开始处使用插入符号（^），表明匹配必须发生在被查找文本开始处。
可以在正则表达式的末尾加上美元符号（$），表示该字符串必须以这个正则表达式的模式结束。
可以同时使用^和$，表明整个字符串必须匹配该模式，也就是说，只匹配该字符串的某个子集是不够的。
'''
beginsWithHelloRe = re.compile(r'^Hello')
mo = beginsWithHelloRe.search('Hello world!')
mo.group()
mo1 = beginsWithHelloRe.search('He said Hello.')
mo1.group()

'''正则表达式 r'\d$'匹配以数字 0 到 9 结束的字符串'''
endsWithNumberRe = re.compile(r'\d$')
mo = endsWithNumberRe.search('Your number is 42')
mo.group()
mo1 = endsWithNumberRe.search('42myNumber')
mo1.group()

'''正则表达式 r'^\d+$'匹配从开始到结束都是数字的字符串'''
wholeStringIsNumRe = re.compile(r'^\d+$')
mo = wholeStringIsNumRe.search('1234567890')
mo.group()
mo1 = wholeStringIsNumRe.search('12345xyz67890')
mo1.group()

8，re.compile的控制参数：re.DOTALL re.IGNORECASE/re.I re.VERBOSE

'''
re.DOTALL    点-星将匹配除换行外的所有字符。通过传入 re.DOTALL 作为 re.compile()的第二个参数， 可以让句点字符匹配所有字符， 包括换行字符。

re.IGNORECASE re.I    若只关心匹配字母，不关心它们是大写或小写。要让正则表达式不区分大小写，可以向 re.compile()传入re.IGNORECASE 或 re.I作为第二个参数。

re.VERBOSE    忽略正则表达式字符串中的空白符和注释，可以用来给正则表达式加注释
'''

'''re.VERBOSE的例子'''
phoneRe = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')
'''可以将正则表达式放在多行中，并加上注释，可以使用三重引号('")，创建一个多行字符串，让它更可读：'''
phoneRe = re.compile(r'''(
                        (\d{3}|\(\d{3}\))? # area code
                        (\s|-|\.)? # separator
                        \d{3} # first 3 digits
                        (\s|-|\.) # separator
                        \d{4} # last 4 digits
                        (\s*(ext|x|ext.)\s*\d{2,5})? # extension
                        )''', re.VERBOSE)

'''re.compile()函数只接受一个值作为它的第二参数。可以使用管道字符（|）将变量组合起来，从而绕过这个限制。管道字符在这里称为“按位或”操作符。'''
someRe = re.compile('foo', re.IGNORECASE | re.DOTALL)
someRe = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)

posted @ 2021-10-21 22:08 tensor_zhang 阅读(111) 评论(0) 编辑收藏举报

刷新页面返回顶部

tensor_zhang

python自动化0-正则表达式

1，re模块: compile search group groups findall sub

2，分组()以及分组编号 范围[] 匹配次数{}

3，管道字符 |

4，? + *

5，贪心与非贪心匹配 ?

6，字符分类

7，^ $

8，re.compile的控制参数：re.DOTALL re.IGNORECASE/re.I re.VERBOSE

公告

2，分组()以及分组编号范围[] 匹配次数{}