python之路，正则表达式

python3　　正则表达式

前言：
（1）. 处理文本称为计算机主要工作之一
（2）根据文本内容进行固定搜索是文本处理的常见工作
（3）为了快速方便的处理上述问题，正则表达式技术诞生，逐渐发展为一个单独技术被众多语言使用

1，定义：

为高级文本匹配模式，提供了搜索，替代等功能，本质是由一些字符和特殊符号组成的字串，这个字串描述了字符和字符的重复行为，可以匹配某一类特征的字符串集合。

2，要求

（1）熟练正则表达式符合和用法；

（2）能够正确的理解和简单使用正则表达式进行匹配

（3）能够使用python， re模块操作正则表达式

3，正则特点：

（1）方便进行检索和修改

（2）支持语音众多

（3）使用灵活变化多样

（4）文本处理， mongo存储某一类字符串，django, tornado路由，爬虫文本匹配；

正则的规则和用法；

导入re模块

findall(pattern, string, flags=0)

功能：使用正则表达式匹配字符串

参数： regex : 正则表达式

　　 string 目标字符串

返回值：匹配到的内容(列表)

元字符（即正则表达式中有特殊含义的字符）

*普通字符

元字符： abc

匹配规则：匹配相应的普通字符

eg: ab ---> abcdef : ab

*使用或多个正则同时匹配

元字符： |

匹配规则：符号两侧的正则均能匹配

eg： ab|fg ----> absrgerfg : ab fg

*匹配单一字符

元字符： .

匹配规则：匹配任意一个字符，'\n' 除外；

eg: f.o ---> foo fuo fao f@o

*匹配字符串开头

元字符：^

匹配规则：匹配一个字符串的开头位置

eg, ^Hello ---> Hello world　　 :　　Hello

*匹配字符串结尾

元字符：$

匹配规则：匹配一个字符串的开头位置

eg, py$ ---> Hello.py　　 :　　py

*匹配重复0次货多次

元字符： *

匹配规则：匹配前面出现的正则表达式0次或多次

eg： ab* a ab　　abbbb

>>> re.findall('ab','absgerewrgabsgre')
['ab', 'ab']
>>> re.findall('ab|fg','gwrgergabrgefg')
['ab', 'fg']
>>> re.findall('f.o','foofaoagref@o')
['foo', 'fao', 'f@o']
>>> re.findall('^H','Hello world')
['H']
>>> re.findall('^Hello','Hello world')
['Hello']
>>> re.findall('py$','hello.py')
['py']
>>> re.findall('py$','python')
[]
>>> re.findall('ab*','absgewraggerweabbbbbgrergbgreeab')
['ab', 'a', 'abbbbb', 'ab']
>>> re.findall('.*py$','hello.py')
['hello.py']
>>> re.findall('.*py$','hellopy')
['hellopy']
>>> re.findall('.*py$','hello.py')
['hello.py']
>>>

View Code

*匹配重复1次或多次

元字符： +

匹配规则：匹配前面正则表达式至少出现一次

eg：re.findall('ab+','absgweweabagwerabbbbb')
['ab', 'ab', 'abbbbb']

*匹配重复0次或1次

元字符：？

匹配规则：匹配前面出现的正则表达式0次或1次

eg:>>> re.findall('ab?','absgweweabagwerabbbbb')
['ab', 'ab', 'a', 'ab']

*匹配重复指定次数

元字符： {N}

匹配规则：匹配前面的正则表达式N次

eg： ab{3}　　---- abbb

* 匹配重复指定次数范围

元字符： {M, N}

匹配规则：匹配前面的正则表达式 m次到n次

eg： >>> re.findall('ab{3,5}','abbsgweabbbweabbbbagwerabbbbb')
['abbb', 'abbbb', 'abbbbb']

>>> import re
>>> re.findall('ab*','absgweweabagwerabbbbb')
['ab', 'ab', 'a', 'abbbbb']
>>> re.findall('ab+','absgweweabagwerabbbbb')
['ab', 'ab', 'abbbbb']
>>> re.findall('.+\.py$','hello.py')
['hello.py']
>>> re.findall('.+\.py$','h.py')
['h.py']
>>> re.findall('ab?','absgweweabagwerabbbbb')
['ab', 'ab', 'a', 'ab']
>>> re.findall('ab{3}','absgweweabagwerabbbbb')
['abbb']

>>> re.findall('.{8}','absgweweabagwerabbbbb')
['absgwewe', 'abagwera']
>>>
>>> re.findall('ab{3,5}','abbsgweabbbweabbbbagwerabbbbb')
['abbb', 'abbbb', 'abbbbb']
>>> re.findall('.{4,6}','absgweweabagwerabbbbb')
['absgwe', 'weabag', 'werabb']
>>>

View Code

字符集匹配

元字符：[abcd]

匹配规则：匹配中括号中的字符集，或者是字符集区间的一个字符；

eg: [abcd] --- a 　　b　　c　　d

　　[0-9] ----- 1,2,3,,匹配任意一个数字字符

　　[A-Z] --- A,B,C 匹配任意一个大写字符；

　　[a-z] --- a,b,c 匹配任意一个小写字符；

多个字符集形式可以写在一起

[+-*/0-9a-g] 　　

>>> re.findall('^[A-Z][0-9a-z]{5}','Hello1 Join')
['Hello1']
>>> re.findall('^[A-Z][0-9a-z]+','Hello1 Join')
['Hello1']
>>> re.findall('^[A-Z][0-9a-z]+','Hello1Join')
['Hello1']
>>> re.findall('^[A-Z][0-9a-z]+','Hello1join')
['Hello1join']

*字符集不匹配

元字符：[^.....]

匹配规则：匹配出字符集中字符的任意一个字符

eg: [^abcd]　　　-> e 　　f 　& #

>>> re.findall('[^_0-9a-zA-Z]','helo@163.com')
['@', '.']

*匹配任意数字（非数字）字符

元字符： \d　　[0-9]　　\D　　[^0-9]

匹配规则：\d 匹配任意一个数字字符；\D 匹配任意一个非数字字符

eg：>>> re.findall('1\d{10}','13523538796')
['13523538796']

*匹配任意普通字符（特殊字符）

元字符： \w 　[_0-9a-zA-Z]　, 　　\W　　[^_0-9a-zA-Z]

匹配规则： \w 　　匹配数字字母下划线；　　 \W 　除了数字字母下划线　

eg：>>> re.findall('[A-Z]\w*','Hello World')
['Hello', 'World']

*匹配任意（非）空字符

元字符： \s　　\S

匹配规则： \s 任意空字符 [\n \0 \t \r ] 空格　　换行　　回车　　制表　　

　　　　　\S 任意非空字符

eg：>>> re.findall('hello\s+\S+','hello l&#y hello lucy helloksge')
['hello l&#y', 'hello lucy']

>>> re.findall('1\d{10}','13523538796')
['13523538796']
>>> re.findall('\w*','Hello World')
['Hello', '', 'World', '']
>>> re.findall('[A-Z]\w*','Hello World')
['Hello', 'World']
>>> re.findall('[a-z]*-[0-9]{2}','wangming-20')
['wangming-20']
>>> re.findall('\w+-\d+','wangming-20')
['wangming-20']
>>> re.findall('\w+.\d+','wangming-20')
['wangming-20']
>>> re.findall('hello \w+','hello lily hello lucy helloksge')
['hello lily', 'hello lucy']
>>> re.findall('hello \w+','hello lily hello   lucy helloksge')
['hello lily']
>>> re.findall('hello\s+\w+','hello lily hello   lucy helloksge')
['hello lily', 'hello   lucy']
>>> re.findall('hello\s+\S','hello l&#y hello   lucy helloksge')
['hello l', 'hello   l']
>>> re.findall('hello\s+\S+','hello l&#y hello   lucy helloksge')
['hello l&#y', 'hello   lucy']

View Code

*匹配字符串开头结尾

元字符： \A ^ ,　　\Z $

匹配规则： \A 表示匹配字符串开头位置

　　　　　\Z　表示匹配字符串结尾位置

eg：\Aabc\Z 　　---> abc

*匹配（非）单词边界

元字符： \b　　\B

匹配规则： \b 　　匹配一个单词的边界

　　　　 \B　　匹配一个单词的非边界

数字字母下划线和其他字符的交界处认为是单词边界；

eg：>>> re.findall(r'\bis\b','This is a test')
['is']

>>> re.findall('\Aabc\Z','abcabc')
[]
>>> re.findall('\Aabc\Z','abbc')
[]
>>> re.findall('\Aabc\Z','abc')
['abc']
>>> re.findall('\Aabc\Z','abc abc')
[]
>>> re.findall('\Aabc','abc abc')
['abc']
>>> re.findall('abc\Z','abc abc')
['abc']
>>> re.findall('abc\Z','abcsgeraqabc')
['abc']
>>> re.findall('\Aabc\Z','abcsgaberaqabc')
[]
>>> re.findall('is','This is a test')
['is', 'is']
>>> re.findall('\bis\b','This is a test')
[]
>>> re.findall(r'\bis\b','This is a test')
['is']
>>> re.findall(r'\b86\b','10086 1008612')
[]
>>> re.findall(r'\b10086\b','10086 1008612')
['10086']
>>> re.findall(r'\Bis','This is a test')
['is']
>>>

View Code

元字符总结：

字符：匹配实际字符

匹配单个字符： . \d　　　\D 　\w　　\W　　\s　　\S　　[.....] [^.....]

匹配重复次数： * + ？ {N} 　　{M,N}

匹配字符串位置： ^　　$ \A　　\Z　　\b　　\B

其他：　　|

r 字串和转义

转义： . *　　?　　$　　" "　　' '　　[ ]　　()　　{ }　　\

r　　---->　　将字符串变为raw字串

不进行字符串的转义

两种等价的写法：

>>> re.findall('\\? \\* \\\\','what? * \\')
['? * \\']
>>> re.findall(r'\? \* \\','what? * \\')
['? * \\']

贪婪和非贪婪

和重复元字符相关；

*　　+　　？　　{m,n}

贪婪模式：

　　在使用重复元字符的时候（*　　+　　？　　{m,n}），元字符的匹配总是尽可能多的向后匹配更多内容，即为贪婪模式；

>>> re.findall('ab*','abbbbasgerab')
['abbbb', 'a', 'ab']
>>> re.findall('ab+','abbbbasgerab')
['abbbb', 'ab']
>>> re.findall('ab?','abbbbasgerab')
['ab', 'a', 'ab']
>>> re.findall('ab{3,5}','abbbbasgerab')
['abbbb']

非贪婪模式：

　　尽可能少的匹配内容，只要满足正则条件即可；

贪婪 - -> 非贪婪 *？　　？？　　+？　　{m,n}?

>>> re.findall('ab??','abbbbasgerab')
['a', 'a', 'a']
>>> re.findall('ab{3,5}?','abbbbasgerabb')
['abbb']
>>> re.findall('ab+?','abbbbasgerab')
['ab', 'ab']
>>> re.findall('ab*?','abbbbasgerab')
['a', 'a', 'a']

正则表达式的分组

使用（）为正则表达式分组

（ab）cde　　：表示给ab分了一个子组；

》re.match('(ab)cdef','abcdefghig').group()

>>> re.match('(ab)cdef','abcdefghig')
<_sre.SRE_Match object; span=(0, 6), match='abcdef'>
>>> re.match('(ab)cdef','abcdefghig').group()
'abcdef'
>>> re.match('(ab)cdef','cabcdefghig').group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>>
必须开头

View Code

1，正则表达式的子组用（）表示，增加子组后对整体的匹配没有影响；

2，每个正则表达式可以有多个子组，子组由外到内由左到右为第一个第二个第三个。。。子组；

（（ab）cd（ef）） 3个子组

>>> re.match('(ab)cdef','abcdefghig').group()
'abcdef'
>>> re.match('(ab)cdef','abcdefghig').group(1)
'ab'
>>> re.match('(ab)cd(ef)','abcdefghig').group(1)
'ab'
>>> re.match('(ab)cd(ef)','abcdefghig').group(2)
'ef'
>>> re.match('((ab)cd(ef))','abcdefghig').group(2)
'ab'
>>> re.match('((ab)cd(ef))','abcdefghig').group(1)
'abcdef'
>>> re.match('((ab)cd(ef))','abcdefghig').group(2)
'ab'
>>> re.match('((ab)cd(ef))','abcdefghig').group(3)
'ef'
>>> re.match('((ab)cd(ef))','abcdefghig').group()
'abcdef'
>>>

View Code

3，子组表示一个内部整体，很多函数可以单独提取子组的值；

>>> re.match('(ab)cdef','abcdefghig').group(1)
'ab'
4，子组可以改变重复行为，将子组作为一个整体重复；

>>> re.match('(ab)*','ababababab').group()
'ababababab'

捕获族或非捕获族（命名组和非命名组）

格式： (?P<name>regex)

(?P<word>ab)cdef

（1）某些函数可以通过名字提取子组内容，或者通过名字进行键值对的生成。

　　>>> re.match('(?P<word>ab)cdef','abcdefghi').group()
　　　　'abcdef'
（2）起了名字的子组，可以通过名称重复使用；

（？P=name）

　　>>> re.match('(?P<word>ab)cdef(?P=word)','abcdefabghi').group()
　　　　'abcdefab'

练习：
匹配长度为8-10位的密码。必须以字母开头，数字字母下划线组成
^[a-zA-Z]\w{7,9}$

匹配身份证号
\d{17}(\d|x)

re模块

compile(pattern, flags=0)

功能：获取正则表达式对象

参数： pattern 传入正则表达式

　　 flags 功能标志位提供正则表达式结果的辅助功能；

返回值：返回相应的正则对象；

注： compile 函数返回值的属性函数和re模块属性函数有相同的部分；

（1）相同点

　　功能完全相同

（2）不同点

　　compile 返回值对象属性函数参数中没有pattern和flags部分，因为这两个参数内容在compile生成对象时已经指明，而re模块直接调用这些函数时则需要传入；

　　compile 返回值对象属性函数参数中有pos和endpos参数，可以指明匹配目标字符串的起始位置，而re模块直接调用这些函数时是没有这两个参数;

>>> obj = re.compile('abc')
>>> obj.findall('abcdef')
['abc']
>>>
>>> re.findall('abc','abcdef')
['abc']
>>>
>>> obj.findall('abcdef',pos=0, endpos=20)
['abc']
>>> obj.findall('abcdef',pos=4, endpos=20)
[]
>>>
>>>

View Code

findall(string, pos, endpos)

功能：将正则表达式匹配到的内容存入一个列表返回

参数：要匹配的目标字符串

返回值：返回匹配到的内容列表

注：如果正则表达式中有子组，则返回子组的匹配内容；

# python3 regex.py
['hello', 'world']
[root@shenzhen re]# vim regex.py
[root@shenzhen re]# python3 regex.py
['Hello', 'world']
[root@shenzhen re]# vim regex.py
[root@shenzhen re]# python3 regex.py
['Hello_world']
[root@shenzhen re]# vim regex.py
[root@shenzhen re]# python3 regex.py
['_Hello_world']
[root@shenzhen re]# vim regex.py
[root@shenzhen re]# cat regex.py
#!/usr/local/bin/python3

import re

pattern = r'\w+'
obj = re.compile(pattern)

l = obj.findall('_Hello_world')
print(l)
[root@shenzhen re]#
###########
[root@shenzhen re]# python3 regex1.py
[('ab', 'ef'), ('ab', 'ef')]
[root@shenzhen re]# cat regex1.py
#!/usr/local/bin/python3

import re

pattern = r'(ab)cd(ef)'
obj = re.compile(pattern)
l = obj.findall('abcdefaaagabcdef')
print(l)
[root@shenzhen re]#

View Code

pattern = r'((ab)cd(ef))' 。。。====[('abcdef', 'ab', 'ef'), ('abcdef', 'ab', 'ef')]

split()

功能：以正则表达式切割字符串

返回值：分割后的内容放入列表

eg ： l1 = re.split(r'\s+','hello world nihao China')

》['hello', 'world', 'nihao', 'China']

sub(pattern， re_string ，string，max)

功能：用目标字符串替换正则表达式匹配内容；

参数：re_string 用什么来替换

　　 string 要匹配的目标字符串

　　 max　　最多替换几次

返回值：返回替换后的字符串；

eg：s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day')

　　》##i,##om, ##t is a fine day

　　s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day'，2)

　　》##i,##om, It is a fine day

subn()

功能：同sub

参数：同sub

返回值：比sub多一个实际替换的个数；

eg：s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day',2)

('##i,##om, ##t is a fine day', 2)

s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day')

('##i,##om, ##t is a fine day', 3)

groupindex : compile 对象属性，得到捕获组名和第几组数字组成的字典；

groups ： compile属性，得到一共多少子组；

{'word': 2, 'test': 3}
3
[root@shenzhen re]# cat regex1.py
#!/usr/local/bin/python3

import re

pattern = r'((?P<word>ab)cd(?P<test>ef))'
obj = re.compile(pattern)

print(obj.groupindex)
print(obj.groups)

View Code

finditer()

功能：同findall 查找所有正则匹配到的内容；

参数：同findall

参数值：返回一个迭代器，迭代的每一项都是matchobj

match（pattern, string, flags=0）

功能：匹配一个字符串开头的位置；

参数：目标字符串

返回值：如果匹配到，则返回一个match obj ；

　　　　如果没有匹配到，则返回None

功能：同match 只是可以匹配任意位置，只能匹配一处；

参数：目标字符串

返回值：如果匹配到，则返回一个match obj

　　　　如果没有匹配到，返回None

import re 

obj = re.compile(r'foo')

iter_obj = obj.finditer\
('foo,food on the table')

for i in iter_obj:
    print(i.group())
    # print(dir(i))
#match 匹配开头
print("*********************")
try:
    m_obj = obj.match('Foo,food on the table')
    print(m_obj.group())
except AttributeError:
    print("match none")

print("*********************")
try:
    m_obj = obj.search('Foo,food on the table')
    print(m_obj.group())
except AttributeError:
    print("match none")
#######################
# python3 regex2.py
<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(4, 7), match='foo'>

View Code

fullmatch()

要求目标字符串能够被正则表达式完全匹配；

>>> obj = re.fullmatch('\w+','abcd1')
>>> obj.group()
'abcd1'
>>>

match 对象属性及函数

属性：

re', #使用正则表达式

'pos' #目标字符串的开始位置

'endpos' #目标字符串的结束位置

'lastgroup' #获取最后一组的名称（捕获族）
'lastindex' #最后一组是第几组

]# cat regex3.py
#!/usr/local/bin/python3

import re

re_obj = re.compile('(ab)cd(?P<dog>ef)')

match_obj = re_obj.search('hi,abcdefghigk')

print('re:', match_obj.re)
print('pos:', match_obj.pos)
print('endpos:', match_obj.endpos)

print('lastgroup:', match_obj.lastgroup)
print('lastindex:', match_obj.lastindex)
print('*'*50)


print('search : ', match_obj.group())
[root@shenzhen re]#
[root@shenzhen re]# python3 regex3.py
re: re.compile('(ab)cd(?P<dog>ef)')
pos: 0
endpos: 14
lastgroup: dog
lastindex: 2
**************************************************
search :  abcdef

View Code

方法：
'end' #获取匹配内容在字符串中的结束位置
'start' #获取匹配内容在字符串中的开始位置

'span' #获取匹配内容在字符串中的起止位置

'group' # 获取match对象匹配的内容

参数：默认为0，表示获取整体匹配内容； >=1 表示获取某个子组的匹配内容

返回值：返回对应的字符串；

'groups' #获取所有子组当中的内容；
'groupdict' #返回一个字典；返回所有捕获组构成的字典

# cat regex3.py
#!/usr/local/bin/python3

import re

re_obj = re.compile('(ab)cd(?P<dog>ef)')

match_obj = re_obj.search('hi,abcdefghigk')

################
print('start():',match_obj.start())
print('end():',match_obj.end())
print('span():',match_obj.span())
print('group():',match_obj.group())
print('group(1):',match_obj.group(1))
print('group(2):',match_obj.group(2))
print('groups()',match_obj.groups())
print('groupdict():',match_obj.groupdict())


#print('search : ', match_obj.group())
#######################
start(): 3
end(): 9
span(): (3, 9)
group(): abcdef
group(1): ab
group(2): ef
groups() ('ab', 'ef')
groupdict(): {'dog': 'ef'}

View Code

flags： re直接调用的匹配函数大多有flags参数。功能为辅助正则表达式匹配的标志位；

dir ()
前后__: 魔法方法、特殊方法
都是大写的是：模块的系统变量全局变量
首字母大写后面小写：类

都是小写的是：属性函数，属性变量，方法；

I, IGNORECASE #匹配时忽略大小写

S,DOTALL #匹配换行，对 . 元字符起作用

M, MULTILINE # 开头结尾计算换行，对^ 元字符起作用

X, VERBOSE #让正则能添加注释

同时添加多个flags

re.I | re.S

[root@shenzhen re]# python3 regex4.py
['abcd', 'ABcd', 'ABCD']
['hello world', '', 'nihao china', '', '']
['hello world', 'nihao china']
['hello world\nnihao china\n', '']
hello
abcdef
[root@shenzhen re]# cat regex4.py
#!/usr/local/bin/python3

import re

re_obj = re.compile('abcd',re.I)

l = re_obj.findall('hi,abcd,ABcd, ABCD')
print(l)

s = '''hello world
nihao china
'''

l1 = re.findall('.*',s)
print(l1)

l2 = re.findall('.+',s)
print(l2)

l3 = re.findall('.*',s,re.S)
print(l3)

obj = re.search('^hello',s)
print(obj.group())

#objj = re.search('^\snihao',s,re.M).group()
#print(objj)

re_obj = re.compile('''(ab)#This is group1
                    cd
                    (?P<dog>ef)#This is group dog
                    ''',re.X)
print(re_obj.search('abcdefghi').group())

[root@shenzhen re]#

View Code

练习1：

import re
import time 
import sys

#匹配具体内容
def reg(data,port):
    pattern = r'^\S+'
    re_obj = re.compile(pattern)
    try:
        head_word = re_obj.match(data).group()
    except Exception:
        return None
    if port == head_word:
        pattern = r'address is (\w{4}\.\w{4}\.\w{4})'
        try:
            match_obj = re.search(pattern,data)
            return match_obj.group(1)
        except Exception:
            return None
    else:
        return None



def main(port):
    fd = open('1.txt','r')
    fd.readline()
    fd.readline()

    while True:
        data = ''
        while True:
            s = fd.readline()
            if s == '\n':
                break 
            if s == '':
                print("search over")
                return
            data += s 
        # 将每段数据传入函数进行匹配
        result = reg(data,port)
        if result:
            print("address is :",result)
            return 
 
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("argv error")
        sys.exit(1)
    main(sys.argv[1])

View Code

posted on 2018-07-24 21:34 微子天明阅读(175) 评论(0) 编辑收藏举报

刷新页面返回顶部

微子天明

python之路，正则表达式

python3　　正则表达式

公告

导航

微子天明

python之路，正则表达式

python3 正则表达式

公告

导航

python3　　正则表达式