python中re模块正则表达式（Regular Expression）的基本用法示例

python中re模块正则表达式的基本用法示例

正则表达式（Regular Expression）

正则表达式是自成一体的专业化模块化的编程语言，主要实现对字符串的一些高级操作，对于支持正则表达式的语言都可以用正则表达式处理一些问题。python中可以通过调用re模块来使用，完成正则匹配的相关功能

import re

text = 'the man whose name is written in this note shall die'
# note: re.match(pattern,string,flags=0)
# match 用来从string开头进行匹配，匹配到就返回，这个不是完全匹配。若想完全，加上$符号，e.g.
print re.match('the',text).group() # [out] : the
# group是分组，在re表达式中，一个小括号括起来的就是一个group，有几个group就返回几个group，可以加索引，如group(0)取出来。
print re.match('the$',text) # [out] : None

the
None

# dot用来匹配任何除了换行以外的值（DOTALL情况下也可以匹配\n）
pattern = 'n.me'
print re.match(pattern,text) # None
# 由于match只能匹配开头的，所以匹配不到,换个函数
# findall 用来遍历查找所有匹配的并且返回一个list
# note: re.findall(pattern,string,flags=0)
print type(re.findall(pattern,text)) # list
print re.findall(pattern,text)[0]

None
<type 'list'>
name

# 反斜杠\表示转义，和C语言的转义字符含义相同
txt1 = 'this+is+a*test'
ptn1 = 'is\+'
print re.findall(ptn1,txt1)

['is+', 'is+']

# 乘号和加号 *、+ ： 乘号表示重复前面的一个字符0-inf，加号表示1-inf的重复
txt2 = 'aababbabbb'
ptn21 = 'ab*'
ptn22 = 'ab+'
print re.findall(ptn21,txt2) # ['a', 'ab', 'abb', 'abbb']
print re.findall(ptn22,txt2) # ['ab', 'abb', 'abbb']

['a', 'ab', 'abb', 'abbb']
['ab', 'abb', 'abbb']

# 问号表示匹配前面的字符0或者1次
txt3 = 'abcmnabcdpqabqabcdd'
ptn3 = 'abcd?'
print re.findall(ptn3,txt3) # ['abc', 'abcd', 'abcd']
# 此外，问号还可以表示非贪婪搜索，即在数量不指定的时候，贪婪返回最长的，而非贪婪返回最少的
ptn3greedy = 'abcd*'
ptn3nogreedy = 'abcd*?'
print re.findall(ptn3greedy,txt3) # ['abc', 'abcd', 'abcdd'] 优先匹配两个d的（d最多的）
print re.findall(ptn3nogreedy,txt3) # ['abc', 'abcd', 'abc'] 优先匹配0个d的（d最少的）

['abc', 'abcd', 'abcd']
['abc', 'abcd', 'abcdd']
['abc', 'abc', 'abc']

# ^ 和 $ ：hat符号表示匹配每行的开头，dollar符号表示匹配每行的结尾
txt4 = 'machine learning is fun \ndeep learning is not\nbut'
print txt4
ptn41 = '^mac'
ptn42 = '.t'
print re.findall(ptn41,txt4) # ['mac']
print re.findall(ptn42,txt4) # ['ot', 'ut']

machine learning is fun 
deep learning is not
but
['mac']
['ot', 'ut']

# | 竖线表示逻辑或，表示（）内左右任何一个表达式，如果没括号，就作用于整个re
ptn5 = 'ee|ea'
print re.findall(ptn5,txt4) # ['ea', 'ee', 'ea']

['ea', 'ee', 'ea']

# {m,n}大括号表示重复前面的字符m--n次，{m}表示匹配m次，{m,}表示m到inf次
txt6 = 'stop!stoop!stooop!stoooop!!!'
ptn61 = 'sto{2,3}p'
ptn62 = 'sto{1}p'
ptn63 = 'sto{2,}p'
print re.findall(ptn61,txt6) # ['stoop', 'stooop']
print re.findall(ptn62,txt6) # ['stop']
print re.findall(ptn63,txt6) # ['stoop', 'stooop', 'stoooop']

['stoop', 'stooop']
['stop']
['stoop', 'stooop', 'stoooop']

# 小括号()表示分组，可以看做一个整体
txt7 = 'sherlocklecklackwatsonon'
ptn71 = '(on){2}'
ptn72 = 'l(e|o)ck'
print re.findall(ptn71,txt7) # ['on']
print re.findall(ptn72,txt7) # ['o', 'e']
# 注意在有group的情况下，findall返回的是分组。

['on']
['o', 'e']

下面看关于反斜杠\的一些用法，总体来说，反斜杠将有含义的字符取消含义，将部分普通字符转成有含义，如果反斜杠后面是数字，表示引用该数字作为序号的group的返回值，先看一个引用的栗子：

txt8 = 'howtothinkaboutthinkthatisaquestion'
ptn8 = r'(thin.)(about)\1'
print re.findall(ptn8,txt8) # [('think', 'about')]
print re.search(ptn8,txt8).group() # thinkaboutthink

[('think', 'about')]
thinkaboutthink

普通字符转成特殊字符叫做预定义字符集，常用的有：

# \d 匹配数字 相当于[0-9]
# \D 非数字 [^0-9]
# \s 空白字符包括空格
# \S 非空白
# \w 含下划线的数字或字母 [A-Za-z0-9_]
# \W 小写w的取反
# \A 开头匹配，可以直接替换成^
# \Z 结尾，替换成dollar符
# \b 匹配单词边界中的符号
txt9 = ' abc babc abcd mabcn '
eptn1 = r'\babc\b'
eptn3 = r'\babc' # 左边有空格
eptn4 = r'abc\b' # 右边有空格
print re.findall(eptn1,txt9)
print re.findall(eptn3,txt9)
print re.findall(eptn4,txt9)

['abc']
['abc', 'abc']
['abc', 'abc']

# compile 函数：re.compile(pattern,flags)
# 返回一个对象，flags标志位常用的有：
# re.I ignore case 不区分大小写
# re.S 就是dotall模式，dot也可代替换行符
# re.U 根据Unicode解释，对于\w \b 等有影响
search_sb = re.compile(r'\bsb\b',flags=re.IGNORECASE) # NOTICE:带有预定义字符集的，比如\b\d之类的，要加一个r表示
TEXT = 'HE IS SB, that man is a big sb man'
print search_sb.findall(TEXT)
print search_sb.search(TEXT)
# search 返回值和 match的一样，都是sre match object，下面四个是这个object的方法
print search_sb.search(TEXT).group()
print search_sb.search(TEXT).end()
print search_sb.search(TEXT).start()
print search_sb.search(TEXT).span()

['SB', 'sb']
<_sre.SRE_Match object at 0x7f94282e4370>
SB
8
6
(6, 8)

# 下面附上一些常用的正则表达式：
# Email地址：^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$ 
# 域名：[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(/.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})+/.? 
# InternetURL：[a-zA-z]+://[^\s]* 或 ^http://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?$ 
# 手机号码：^(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3|5|6|7|8|9])\d{8}$ 
# 日期格式：^\d{4}-\d{1,2}-\d{1,2}

小结 :
还有几种re的方法，加上以上的几种，简单总结一下。match前说过了，就是从头开始不完全匹配；而search则是遍历搜索，搜到一个匹配的就返回，返回值是一个match object；findall是找到所有的，也是遍历，返回一个list；compile可以定义一些正则表达式模式，如果经常会用到的话，可以提高效率。

posted @ 2018-02-27 22:54 毛利小九郎阅读(146) 评论(0) 编辑收藏举报

刷新页面返回顶部

兔角与禅 (Part II)

python中re模块正则表达式（Regular Expression）的基本用法示例

python中re模块正则表达式的基本用法示例

正则表达式（Regular Expression）

公告