Python中的正则表达式用法
正则表达式:
re 模块
import re
re.match(pattern,str): 从左边开始匹配,只要匹配失败,就退出
re.search(pattern,str): 从左边开始匹配,如果匹配到第一个,则不再继续匹配
re.findall(pattern,str): 从左边开始匹配,直到匹配完所有满足条件的,并返回一个满足匹配条件的列表
re.sub(pattern,新内容,str): 替换
基础:
[]: 范围
.: 任意字符
|: 或者
(): 一组
量词:
*: >=0
+: >=1
?: 0,1
{m}: =m
{m,}: >=m
{m,n}: [m,n]
预定义:
\s space
\S not space
\d digit
\D not digit
\w word [0-9a-zA-Z_]
\W not word [^0-9a-zA-Z_]
\b
\B
分组:
() ----> group(1)
number
(\w+)(\d) ----> group(1) group(2)
引用:
(\w+)(\d) \1 \2 表示引用前面的内容
name
(?\w+) (?P=name)
贪婪匹配:
Python里数量词默认是贪婪的(在少数语言里也可能是默认非贪婪),总是尝试匹配尽可能多的字符;
非贪婪则相反,总是尝试匹配尽可能烧的字符.
在"*","?","+","{m,n}"后面加上?,使贪婪变成非贪婪
# 大写字母 [A-z]
msg = 'FKRITOFLSDKFWWPGVL'
result = re.match(r'[A-Z]+', msg)
print(result)
# 小写字母 [a-z]
msg = 'sdfwsdfsfsf'
result = re.match(r'[a-z]+', msg)
print(result)
# 数字 [0-9] 或者 \d
msg = '334322341098'
result = re.match(r'\d+', msg)
print(result)
# 带区位的电话号码 电话号码是5~11位,且不能是0开头
msg = '020-43948574'
result = re.match(r'(\d{3}|\d{4})-([1-9]\d{4,10})', msg)
print(result)
area_num = result.group(1)
phone_num = result.group(2)
print('区号:{},电话:{}'.format(area_num, phone_num))
# 手机号码 1开始, 3,5,7,8为第二位,11位数字
msg = '18665028070'
result = re.match(r'1[3578]\d{9}$', msg)
print(result)
# 邮箱 qq,126,163,139 4lkjl2lj234l@qq.com
msg = '4223lsds2l_42@139.cn'
result = re.match(r'\w{5,15}@(qq|126|163|139)\.(com|cn)', msg)
print(result)
# HTML标签
# 取名的用法 ?P<name> ?p=name
msg = '<html><div><a>百度一下就知道了</a></div></html>'
result = re.match(r'(<(?P<tag1>[0-9a-zA-Z]+)>(.*)</(?P=tag1)>)', msg)
print(result)
print(result.group())
print('0---', result.group(0))
print('1---', result.group(1))
print('2---', result.group(2))
print('3---', result.group(3))
# sub 把所有的分数都加1
msg = '001:91,002:99,003:95'
def func(pattern):
match = pattern.group(1)
temp1 = pattern.group(2)
temp2 = int(temp1) + 1
return match.replace(temp1, str(temp2))
result = re.sub(r'(:(\d+),?)', func, msg)
print(result)
# split 分割
msg = '001:91,002:99,003:95'
result = re.split(r'[:,]', msg)
print(result)
# 贪婪与非贪婪
msg = 'abc1234abc'
result = re.match(r'abc(\d+)', msg) # 贪婪
result2 = re.match(r'abc(\d+?)', msg) # 非贪婪
print(result)
print(result2)
------学习贵在分享,贵在记录,贵在总结。