python语言之正则

(一)正则表达式的构成

正则表达式由两种元素组成:

  • 字面值

    • 普通字符和
    • 需要转义的字符(\,^,$,.,|,?,*,+,(),[],{})
  • 元字符(特殊意思)

    .:除\n外的所有字符

    \d:数字,等同于[0-9]

    \D:匹配所有非数字 [ ^ 0-9]

    \s:空白字符,\t\r\n\f\v

    \S:非空白字符[ ^\t\r\n\f\v]

    \w:字母数字字符[A-Za-z0-9_]

    \W:字母数字字符[ ^A-Za-z0-9_]

    |:yes|no

    +:一次或者多次

    ?:一次或者0次

    *:0次或者多次

    {3,5}:3次到5次

    {m}:m次

    {m,}:最少m次

    {,n}:最多n次

  • 贪婪与非贪婪

    • 非贪婪(两次后加?)

      .*?

    • 贪婪(默认)

  • 边界匹配

    ^:行首

    $:行尾

    \b:单词边界

    \B:非单词边界

    \A:输入开头

    \Z:输入结尾

(二)Python正则模块之RegexObject

模块:import re

RegexObject:编译后的正则表达式对象(编译为字节码并缓存re.compile),有利于重用

findAll方法

import re
text = "Tom is 8 years old. Mike is 23 years old"
pattern = re.compile('\d+')
pattern.findall(text)
['8', '23']
>>> pattern = re.compile('[A-Z]\w+')
>>> pattern.findall(text)
['Tom', 'Mike']
----------------------------------------------------
s = '\\author:Kobe'
pattern = re.compile('\\author')
pattern.findall(s)
[]#p匹配不到

pattern = re.compile('\\\\author')
pattern.findall(s)
['\\author']
pattern = re.compile(r'\\author')
pattern.findall(s)
['\\author']

match(str,[,pos[,endpos]])方法,返回MatchObject:从开始的位置匹配,或指定从某个位置匹配,到哪个位置结束

pattren = re.compile(r'<html>')
text = '<html><head></head><body></body></html>'
pattren.match(text)
<_sre.SRE_Match object; span=(0, 6), match='<html>'>

text1 = ' <html><head></head><body></body></html>'
pattren.match(text1)
pattren.match(text1,1)
<_sre.SRE_Match object; span=(1, 7), match='<html>'>

search(str,[,pos[,endpos]])方法任意位置搜索,返回MatchObject

text = "Tom is 8 years old. Mike is 23 years old"
p1 = re.compile('\d+')
p2 = re.compile('[A-Z]\w+')
p1.match(text)
p2.match(text)
<_sre.SRE_Match object; span=(0, 3), match='Tom'>
p1.search(text)
<_sre.SRE_Match object; span=(7, 8), match='8'>
p2.search(text)
<_sre.SRE_Match object; span=(0, 3), match='Tom'>

finditer方法,类似于findAll,查找所有匹配项,返回一个可迭代对象

it = p1.finditer(text)
for m in it:
    print(m)
    
<_sre.SRE_Match object; span=(7, 8), match='8'>
<_sre.SRE_Match object; span=(28, 30), match='23'>

(三)Python正则模块之MatchObject

text
'Tom is 8 years old. Mike is 23 years old'
pattern = re.compile(r'(\d+).*?(\d+)')
m = pattern.search(text)
m
<_sre.SRE_Match object; span=(7, 30), match='8 years old. Mike is 23'>
m.group()
'8 years old. Mike is 23'
m.group(0)
'8 years old. Mike is 23'
#查看匹配的第一个分组
m.group(1)
'8'
#查看匹配的第二个分组
m.group(2)
'23'
#查看匹配的第一个分组的起始下标
m.start(1)
7
#查看匹配的第一个分组的结束下标
m.end(1)
8
#查看匹配的第一个分组的开始下标和结束下标
m.span(1)
(7, 8)
#查看匹配的第二个分组的起始下标
m.start(2)
28
#查看匹配的第二个分组的结束下标
m.end(2)
30

m.groups()
('8', '23')
type(m.groups())
<class 'tuple'>

-----------------------------------------------
text = 'i am a good teacher'
pattern = re.compile('(\w+) (\w+)')
pattern.findall(text)
[('i', 'am'), ('a', 'good')]
iter = pattern.finditer(text)
for m in iter:
    print(m.group())
    
i am
a good


(四)Group编组

  1. 从匹配模式中抽取信息
re.search(r'ab+c','ababc')
<_sre.SRE_Match object; span=(2, 5), match='abc'>
re.search(r'(ab)+c','ababc')
<_sre.SRE_Match object; span=(0, 5), match='ababc'>
  1. 限制备选项范围

    re.search(r'Cent(er|re)','Center')
    <_sre.SRE_Match object; span=(0, 6), match='Center'>
    re.search(r'Cent(er|re)','Centre')
    <_sre.SRE_Match object; span=(0, 6), match='Centre'>
    
  2. 重用正则模式中提取的内容

re.search(r'(\w+) \1','hello world')
re.search(r'(\w+) \1','hello hello world')
<_sre.SRE_Match object; span=(0, 11), match='hello hello'>
  1. 带名称的group
text = 'tom:98'
pattern = re.compile(r'(\w+):(\d+)')
m = pattern.search(text)
m.group()
'tom:98'
m.groups()
('tom', '98')
------------------------------------------------------
pattern = re.compile(r'(?P<name>\w+):(?P<score>\d+)')
p = pattern.search(text)
p.group()
'tom:98'
p.group('name')
'tom'
p.group('score')
'98'
--------------------------------------------------------
re.search(r'(?P<name>\w+) (?P=name)','hello hello world')
<_sre.SRE_Match object; span=(0, 11), match='hello hello'>

(五)字符串操作

split

text = 'Beautiful is better than ugly.\nExplicity is better than implicity'
p = re.compile(r'\n')
p.split(text)
['Beautiful is better than ugly.', 'Explicity is better than implicity']
-----------------------------------------------------------------------
re.split(r'\W','good morning')
['good', 'morning']
re.split(r'-','good-morning')
['good', 'morning']
re.split(r'(-)','good-morning-hello')
['good', '-', 'morning', '-', 'hello']

sub

ords = 'ord000\nord001\nord003'
re.sub(r'\d+','-',ords)
'ord-\nord-\nord-'
---------------------------------------------
text = 'Beautiful is *better* than ugly.'
re.sub(r'\*(.*?)\*','<strong>\g<1></strong>',text)
'Beautiful is <strong>better</strong> than ugly.'
re.sub(r'\*(?P<name>.*?)\*','<strong>\g<name></strong>',text)
'Beautiful is <strong>better</strong> than ugly.'
-----------------------------------------------
ords = 'ord000\nord001\nord003'
re.sub(r'([a-z]+)(\d+)','\g<2>-\g<1>',ords)
'000-ord\n001-ord\n003-ord'

re.subn(r'([a-z]+)(\d+)','\g<2>-\g<1>',ords)
('000-ord\n001-ord\n003-ord', 3)

(六)编译标记

re.I 忽略大小写

text = 'python Python PYTHON'
re.findall(r'python',text)
['python']
re.findall(r'python',text,re.I)
['python', 'Python', 'PYTHON']

re.M 匹配多行

re.findall(r'^<html>','\n<html>')
[]
re.findall(r'^<html>','\n<html>',re.M)
['<html>']

re.S .匹配任意字符,包括\n

re.findall(r'\d(.)','1\ne',re.S)
['\n']

(七)模块级别的操作

清理正则缓存

re.purge()

逃逸字符

re.escape()

re.findall(r'^','^python^')
['']
re.findall(re.escape('^p'),'^python^')
['^p']
posted @ 2020-01-08 23:10  sowhat1943  阅读(299)  评论(0编辑  收藏  举报