Python 正则表达式 re 模块(转载)

Python 正则表达式

正则表达式

需要使用 re 模块， re 模块用于对 python 的正则表达式的操作

语法

#导入模块名
import re
 
# 生成要匹配的正则对象 ， ^代表从开头匹配，[0-9]代表匹配0至9的任意一个数字， 所以这里的意思是对传进来的字符串进行匹配，如果这个字符串的开头第一个字符是数字，就代表匹配上了
p = re.compile("^[0-9]") 
 
# 按上面生成的正则对象 去匹配 字符串，如果能匹配成功，这个 m 就会有值，否则 m 为 None
m = p.match('14534Abc') 

if m: 
    # 不为空代表匹配上了 m.group()返回匹配上的结果，此处为1，因为匹配上的是1这个字符
    print(m.group())
else:
    print("doesn't match.")

上面的第2行和第3行代码也可以合并成一行来写：

m = p.match("^[0-9]",'14534Abc')

效果是一样的，区别在于

第一种方式是提前对要匹配的格式进行了编译（对匹配公式进行解析），这样再去匹配的时候就不用在编译匹配的格式
第二种简写是每次匹配的时候都要进行一次匹配公式的编译
所以，如果你需要从一个5w行的文件中匹配出所有以数字开头的行，建议先把正则公式进行编译再匹配，这样速度会快点

正则表达式元字符

字符匹配

.       ：除换行符以外的任意单个字符
[]      ：指定范围内字符
[^]     ：指定范围外字符

次数匹配

*       ：任意次，0，1，多次
.*      ：任意字符 任意次
?       ：至多1次或0次
+       ：至少出现1次或多次
{m}     ：其前面字符出现m次
{m,n}   ：其前面字符出现至少m次，至多n次
{m,}    ：其前面字符出现至少m次
{,n}    ：其前面字符出现至多n次

位置锚定

^       ：匹配字符串的开头
$       ：匹配字符串的末尾

分组及引用

()      ：分组，括号内模式会被记录于正则表达式引擎
后向引用 ：\1  \2  \3.....

或

a|b     ：a或者b
C|cat   ：C或cat
(C|c)at ：Cat或cat

转义字符

\w      ：匹配字母数字
\W      ：匹配非字母数字
\s      ：匹配任意空白字符，等价于 [\t\n\r\f].
\S      ：匹配任意非空字符
\d      ；匹配任意数字，等价于 [0-9].
\D      ：匹配任意非数字
\A      ：匹配字符串开始
\Z      ：匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串
\z      ：匹配字符串结束
\G      ：匹配最后匹配完成的位置。
\b      ：匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
\B      ：匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。
\n      ：匹配一个换行符
\t      ：匹配一个制表符
\1...\9 ：匹配第n个分组的子表达式

正则表达式常用5种操作

re.match(pattern, string, flags=0)

从起始位置开始根据模型去字符串中匹配指定内容，匹配单个

正则表达式
要匹配的字符串
标志位，用于控制正则表达式的匹配方式

import re
 
obj = re.match('\d+', '957evescn')
if obj:
    print(obj.group())
 
# 输出结果
957

# flags
I = IGNORECASE = sre_compile.SRE_FLAG_IGNORECASE # ignore case
L = LOCALE = sre_compile.SRE_FLAG_LOCALE # assume current 8-bit locale
U = UNICODE = sre_compile.SRE_FLAG_UNICODE # assume unicode locale
M = MULTILINE = sre_compile.SRE_FLAG_MULTILINE # make anchors look for newline
S = DOTALL = sre_compile.SRE_FLAG_DOTALL # make dot match newline
X = VERBOSE = sre_compile.SRE_FLAG_VERBOSE # ignore whitespace and comment

re.search(pattern, string, flags=0)

匹配整个字符串，返回第一个符合条件的匹配

import re
 
obj = re.search('\d+', 'gmkk957evescn')
if obj:
    print(obj.group())
 
# 输出结果
957

group和groups

group() 方法用于返回整个匹配的字符串或指定组的匹配字符串。如果没有指定组号，则默认返回整个匹配的字符串
groups() 方法用于返回一个包含所有组匹配字符串的元组。示例代码如下

import re
 
a = "123abc456"
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group())
 
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(0))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(1))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(2))
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).group(3))
 
print(re.search("([0-9]*)([a-z]*)([0-9]*)", a).groups())
 
# 输出结果
123abc456
 
123abc456
123
abc
456
 
('123', 'abc', '456')

re.findall(pattern, string, flags=0)

找到所有要匹配的字符并返回列表格式

import re
 
obj = re.findall('\D+', 'evescn666gmkk')
print(obj)
 
# 输出结果
['evescn', 'gmkk']

re.sub(pattern, repl, string, count=0, flags=0)

替换匹配到的字符

import re
 
content = "123abc456"
new_content = re.sub('\d+', 'sb', content)
# new_content = re.sub('\d+', 'sb', content, 1)
print(new_content)
 
# 输出结果
sbabcsb

相比于str.replace功能更加强大

re.split(pattern, string, maxsplit=0, flags=0)

将匹配到的格式当做分割点对字符串分割成列表

import re
 
content = '1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'
new_content = re.split('\*', content)
# new_content = re.split('\*', content, 1)
print(new_content)
 
###### 输出结果
['1 - 2 ', ' ((60-30+1', '(9-2', '5/3+7/3', '99/4', '2998+10', '568/14))-(-4', '3)/(16-3', '2) )']
['1 - 2 ', ' ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )']
######
 
content = '1 - 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )'
new_content = re.split('[\+\-\*\/]+', content)
# new_content = re.split('[\+\-\*\/]+', content, 1)
print(new_content)
 
###### 输出结果
['1 ', ' 2 ', ' ((60', '30', '1', '(9', '2', '5', '3', '7', '3', '99', '4', '2998', '10', '568', '14))', '(', '4', '3)', '(16', '3', '2) )']
['1 ', ' 2 * ((60-30+1*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2) )']
######
 
inpp = '1-2*((60-30 +(-40-5)*(9-2*5/3 + 7 /3*99/4*2998 +10 * 568/14 )) - (-4*3)/ (16-3*2))'
inpp = re.sub('\s*', '', inpp)
print(inpp)
 
new_content = re.split('\(([\+\-\*\/]?\d+[\+\-\*\/]?\d+){1}\)', inpp, 1)
print(new_content)
 
###### 输出结果
1-2*((60-30+(-40-5)*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))
['1-2*((60-30+', '-40-5', '*(9-2*5/3+7/3*99/4*2998+10*568/14))-(-4*3)/(16-3*2))']
######

几个常见正则例子

匹配手机号

import re
 
phone_str = "my name is evescn, and my phone number is 18111555666"
 
m = re.search("(1)([358]\d{9})",phone_str)
if m:
    print(m.group())
 
# 输出结果
18111555666

匹配IPv4

ip_addr = "inet 172.19.133.212 brd 172.19.143.255"
  
m = re.search("(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}", ip_addr)
  
print(m.group())
 
# 输出结果
172.19.133.212

分组匹配地址

contactInfo = 'Evescn, ChengDu: 028-8888888'
 
#分组
match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)
"""
>>> match.group(1)
  'Evescn'
>>> match.group(2)
  'ChengDu'
>>> match.group(3)
  '028-8888888'
"""
 
match = re.search(r'(?P<name>\w+), (?P<addr>\w+): (?P<phone>\S+)', contactInfo)
"""
>>> print(match.group('name'))
  'Evescn'
>>> print(match.group('addr'))
  'ChengDu'
>>> print(match.group('phone'))
  '028-8888888'
"""

匹配email

email = "evescn.gmkk@163.com   http://blog.evescn.com"
 
m = re.search(r"[0-9.a-z]{0,26}@[0-9.a-z]{0,20}.[0-9a-z]{0,8}", email)
print(m.group())
 
# 输出结果
evescn.gmkk@163.com

转载自

http://www.cnblogs.com/alex3714/articles/5143440.html

posted @ 2023-06-23 15:03 evescn 阅读(53) 评论(0) 收藏举报

刷新页面返回顶部

evescn