Python基础 - 10正则与re

Posted on 2021-10-23 08:58 Kingdomer 阅读(82) 评论(0) 编辑收藏举报

Python基础 - 10正则与re

一、正则表达式

正则表达式是对字符串操作的一种逻辑公式，用事先定义好的一些特定字符、及特定字符的组合，组成一个"规则字符串"。

Regular Expression 普通字符（如a到z之间的字母）特殊字符（元字符）

\A：表示从字符串的开始处匹配。

\Z：表示从字符串的结束处匹配，如果存在换行，只匹配换行前的结束字符串。

\b：匹配一个单词边界，也就是单词和空格间的位置。如 'py\b' 可以匹配 "python" 的py，不能匹配 "openpyxl"的py。

\B：匹配非单词边界。 'py\b' 可以匹配 "openpyxl"的py，不能匹配 "python" 的py。

\d：匹配任意数字。等价于 [0-9]。

\D：匹配任意非数字字符。等价于 [^\d]。

\s：匹配任意空白字符。等价于 [\t\n\r\f]。

\S：匹配任意非空白字符。等价于 [^\s]。

\w：匹配任意字母数字及下划线，等价于 [0-9a-zA-Z_]。

\W：匹配任意非字母数字及下划线，等价于 [^\w]。

\\：匹配原义的反斜杠\。

量词：

re*：匹配 0个或多个表达式

re+：匹配 1个或多个表达式

re?：匹配 0个或 1个前面的正则表达式。

re{m}：匹配 m个前面的正则表达式。

re{m,}：匹配 m个或多个前面的正则表达式。

re{m,n}：匹配 m到n个前面的正则表达式。

非贪婪模式: *? +? ?? {m,n}?

^：匹配字符串的开头。

$：匹配字符串的结尾。

.：匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。

[...]：用来表示一组字符, 单独列出。如[amk] 匹配 'a'，'m'或'k'

(re)：匹配括号内的表达式，也表示一个组

a|b：匹配a 或 b

二、re模块

2.1 match方法

import re

msg = '娜扎热巴佟丽娅'
pattern = re.compile('佟丽娅')
result = pattern.match(msg)
print(result)                   # None

msg = '佟丽娅娜扎热巴'
print(result)                   # <re.Match object; span=(0, 3), match='佟丽娅'>

# 只从开头进行匹配，匹配不成功则返回None
s = '娜扎热巴佟丽娅'
result = re.match('佟丽娅', s)
print(result)                   # None

2.2 search方法

result = re.search('佟丽娅', s)
print(result)               # <re.Match object; span=(4, 7), match='佟丽娅'>
print(result.span())        # (4, 7)
print(result.group())       # 佟丽娅
print(result.groups())      # ()


msg = 'abcd8ddaa3dooridosr'
result = re.search('[a-z][0-9][a-z]', msg)
print(result.group())       # d8d
print(result.groups())      # ()

s = '原因3'
result = re.search('[0-9]', s)
print(result)               # <re.Match object; span=(2, 3), match='3'>
print(result.group())       # 3

2.3 findall方法

findall 匹配整个字符串，找到一个继续向下找，直到字符串结尾

msg = 'abcd8ddaa3doorid55osr'
result = re.findall('[a-z][0-9]+[a-z]', msg)
print(result)               # ['d8d', 'a3d', 'd55o']

2.4 match正则表达式

qq = '129499444'            # qq号码验证：开头不能为0， 5-11位
result = re.match('^[1-9][0-9]{4,10}$', qq)
print(result)               # <re.Match object; span=(0, 9), match='129499444'>

username = 'admina'         # 用户名以字母开头，长度6位以上
result = re.match('^[a-zA-Z][0-9A-Za-z]{5,}', username)
print(result)               # <re.Match object; span=(0, 6), match='admina'>

username = 'admina11#$'     # 必须是字母或数字
result = re.match('^[a-zA-Z][0-9A-Za-z]{5,}$', username)
print(result)               # None

username = 'admin100'
result = re.match('^[a-zA-Z]\w{5,}$',username)
print(result)               # <re.Match object; span=(0, 8), match='admin100'>

msg = 'a*py ab.txt bb.py kk.png uu.py apyb.txt'
result = re.findall(r'\w*\.py\b', msg)
print(result)               #  ['bb.py', 'uu.py']

# 匹配 0-100数字
num = '1'
result = re.match(r'[1-9]?\d$', num)  # 匹配
print(result)               #  0-9, 10-99 , 1000成功， 09, 100 失败

result = re.match(r'[1-9]?\d?$|100$', num)
print(result)               # 09 失败

# (word|word|word)        [abc] 表示的是一个字母
email = '15704393432@163.com'
result = re.match(r'\w{5,20}@(163|126|qq)\.com', email)
result = re.match(r'\w{5,20}@(163|126|qq)\.(com|cn)$', email)
print(result)

2.5 分组

# 起名方式： (?P<名字>正则)   (?P=名字)
# 分组： () --> result.group(1) 获取组中匹配的内容
# 引用分组匹配内容
# 1. number \number 引用第number组的内容       2. (?P<名字>正则)   (?P=名字)

msg = '<html><h1>text</h1></html>'
result = re.match(r'<(?P<name1>\w+)><(?P<name2>\w+)>(.+)</(?P=name2)></(?P=name1)>', msg)
print(result)           # <re.Match object; span=(0, 26), match='<html><h1>text</h1></html>'>
print(result.group(1))  # html
print(result.group(2))  # h1
print(result.group(3))  # text

# number
result = re.match(r'<([0-9a-zA-Z]+)><([0-9a-zA-Z]+)>(.+)</\2></\1>$', msg)
print(result)           # <re.Match object; span=(0, 26), match='<html><h1>text</h1></html>'>
print(result.group(1))  # html


# 不需要引用分组的内容
result = re.match(r'<[0-9a-zA-Z]+>(.+)</[0-9a-zA-Z]+>', msg)
print(result)           # <re.Match object; span=(0, 26), match='<html><h1>text</h1></html>'>
print(result.group(1))  # <h1>text</h1>

2.6 sub、split方法

# sub(正则表达式， '新内容', string) 替换
result = re.sub(r'\d+', '90', 'java:99,python:100')
print(result)               # java:90,python:90

def func(temp):
    num = temp.group()
    new_num = int(num) + 1
    return str(new_num)

result = re.sub(r'\d+', func, 'java:99,python:100')
print(result)               # java:100,python:101

result = re.split(r'[,:]', 'java:99,python:100')
print(result)               # ['java', '99', 'python', '100']

# 不以4、7结尾的手机号（11位）
phone = '14578890867'
result = re.match(r'1\d{9}[0-35-689]$', phone)
print(result)               # None

phone = '010-12345678'
result = re.match(r'(\d{3}|\d{4})-(\d{8})$', phone)
print(result)               # <re.Match object; span=(0, 12), match='010-12345678'>
print(result.group())       # 010-12345678
print(result.group(1))      # 010
print(result.group(2))      # 12345678
#print(result.group(3))     # IndexError: no such group

msg = '<html>abc</html>'
result = re.match(r'<[0-9a-zA-Z]+>(.+)</[0-9a-zA-Z]+>', msg)
print(result)               # <re.Match object; span=(0, 16), match='<html>abc</html>'>
print(result.group(1))      # abc

msg = '<html>abc</html>'
result = re.match(r'<([0-9a-zA-Z]+)>(.+)</\1>$', msg)
print(result)  # <re.Match object; span=(0, 16), match='<html>abc</html>'>
print(result.group(1))      # html
print(result.group(2))      # abc

2.7 贪婪模式

Python中数量词默认是贪婪的，总是尝试匹配尽可能多的字符。非贪婪尝试匹配尽可能少的字符。

在 "*","?","+","{m,n}"后面加上?，使贪婪变成非贪婪。

import re

msg = 'abc123abc'
result = re.match(r'abc(\d+)',msg)
print(result)            # <re.Match object; span=(0, 6), match='abc123'>

result = re.match(r'abc(\d+?)',msg)
print(result)            # <re.Match object; span=(0, 4), match='abc1'>

刷新页面返回顶部

我的运维笔记