Jul_31 PYTHON REGULAR EXPRESSIONS
1.Special Symbols and Characters
1.1 single regex 1
. ,Match any character(except \n)
^ ,Match start of string
$ ,Match end of string
* ,Match 0 or more occurrences preceding regex
+ ,Match 1 or more occurrences preceding regex
? ,Match 0 or 1 occurrence preceding regex
{N} ,Match N occurrences preceding regex
{M,N} ,Match from M to N occurrences preceding regex
[...] ,Match any single character from character class
[..x-y..] ,Match any single character in the range from x to y ;["-a],In an ASCII system,all characters that fall between '"' and "a",that is ,between ordinals 34 and 97。
[^...] ,Do not match any character from character class ,including any ranges ,if present
(*|+|?|{})? ,Apply "non-greedy" versiongs of above occurrence/repetition symbols;默认情况下* + ? {}都是贪婪模式,在其后加上'?'就成了非贪婪模式。
(...) ,Match enclosed regex and save as subgroup .
1.2 single regex 2
\d ,Match any decimal digit ,same as [0-9](\D is inverse of \d:do not match any numeric digit)
\w ,Match any alphanumeric character,same as [A-Za-z0-9](\W is inverse of \w)
\s ,Match any whitespace character,same as [\n\t\r\v\f](\S is inverse of \s)
\b ,Match any word boundary(\B is inverse of \b)
\N ,Match saved subgroup N(see (...) above) ;exam:print(\1,\3,\16)
\c ,transferred meaning ,without its special meaning;exam:\.,\\,\*
\A(\Z) ,Match start (end) fo string (also see ^ and $ above)
1.3 complex regex
(?=...) ,前向肯定断言。如果当前包含的正则表达式(这里以 ... 表示)在当前位置成功匹配,则代表成功,否则失败。一旦该部分正则表达式被匹配引擎尝试过,就不会继续进行匹配了;剩下的模式在此断言开始的地方继续尝试。举例:love(?=FishC) 只匹配后边紧跟着 FishC的字符串 love。
(?!...) ,前向否定断言。这跟前向肯定断言相反(不匹配则表示成功,匹配表示失败)。举例:FishC(?!\.com)只匹配后边不是 .com& 的字符串 Fish。
(?<=...) ,后向肯定断言。跟前向肯定断言一样,只是方向相反。举例:(?<=love)FishC 只匹配前边紧跟着 love 的字符串 FishC。
(?<!...) ,后向否定断言。跟前向否定断言一样,只是方向相反。举例:(?<!FishC)\.com 只匹配前边不是 FishC的字符串 .com。
(?:) ,该子组匹配的字符串无法从后面获取。
(?(id/name)yes-pattern|no-pattern) ,1. 如果子组的序号或名字存在的话,则尝试 yes-pattern 匹配模式;否则尝试 no-pattern 匹配模式;
2. no-pattern 是可选的
举例:(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) 是一个匹配邮件格式的正则表达式,可以匹配 <user@fishc.com>; 和 'user@fishc.com',但是不会匹配 '<user@fishc.com' 或 'user@fishc.com>'
1.4 匹配邮箱地址举例
import re
data = 'z843248880@163.com'
data1 = '<z843248880@163.com>'
data2 = '<z843248880@163.com'
data3 = 'z843248880@163.com>'
p1 = '(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)'
p2 = '\w+@\w+\.\w+'
p3 = '(<)?\w+@\w+\.\w+(?(1)>|$)'
m1 = re.match(p3, data3)
print(m1.group())
PS:p1里的(?:\.\w+)代表这里的"\.\w+"匹配的字符串不会被后面获取;p1里的"(?(1)>|$)"表示,如果前面有“<",则此处匹配">",如果前面没有"<",则此处匹配结束符”$“,"(1)"代表的前面的第一个括号里的字符串,也就是"(<)";p1和p3的作用一样;p2不能排除仅有"<"或仅有">"的情况。
1.5 The re Modules:Core Functons and Methods
match(pattern,string,flags=0) ,Attempt to match pattern to string with optional flags;return match object on success,None on failure;it is start of the string to match.
search(pattern,string,flags=0) ,Search for first occurrence of pattern within string with optional flags;return match object on success,None on failure;it is start of the string to match.
findall(pattern,string[,flags=0]) ,Look for all occurrences of pattern in string;return a list of matches.
finditer(pattern,string[,flags=0]) ,Same as findall(),except returns an iterator instead of a list;for each match,the iterator returns a match object.
split(pattern,string,max=0) ,Split string into a list according to regex pattern delimiter and return list of successful matches,aplitting at most max times(split all occurrences is the default)
1.6 the usage of "?i" and "?m"
>>> import re
>>> re.findall(r'(?i)yes','yes Yes YES')
['yes', 'Yes', 'YES']
>>> re.findall(r'(?i)th\w+','The quickest way is through to this tunnel.')
['The', 'through', 'this']
>>> re.findall(r'(?im)(^th[\w ]+)',''')
... this line is the first,
... another line,
... that line,it's the best.
... ''')
['this line is the first', 'that line']
>>> re.findall(r'(?i)(^th[\w ]+)','''
... this line is the first,
... another line,
... that line ,it's the best.
... ''')
[]
>>> re.findall(r'(?i)(^th[\w \n,]+)','''
... this line is th,
... anonjkl line,
... that line,it the best.
... ''')
[]
By using "multiline" we can perform the search across multiple lines of the target string rather than treating the entire string as a single entity.
1.7 the usage of spilt
re.split(r'\s\s+',eachline) ,at least two whitespace.
re.split(r'\s\s+|\t',eachline.rstrip()) ,at least two whitespace or one tablekey;rstrip(),delete the '\n'.
1.8 one example
from random import randrange,choice
from string import ascii_lowercase as lc
from sys import maxsize
from time import ctime
tlds = ('com','org','net','gov','edu')
for i in range(randrange(5,11)):
dtint= randrange(1469880872)
dstr = ctime(dtint)
llen = randrange(4,8)
login = ''.join(choice(lc) for j in range(llen))
dlen = randrange(llen,13)
dom = ''.join(choice(lc) for j in range(dlen))
print('%s::%s@%s.%s::%d-%d-%d' % (dstr,login,dom,choice(tlds),dtint,llen,dlen))
result:
Sat Nov 7 01:09:06 1998::hbtua@yzhnjyjanwuq.gov::910372146-5-12
Sat Oct 17 09:27:56 2015::djbljsf@uidicjppd.gov::1445045276-7-9
Sun Nov 18 06:10:07 1979::fkobvlf@zlnlyjej.org::311724607-7-8
Wed Jul 23 17:23:03 1986::hovwgi@wiidgvnng.net::522490983-6-9
Tue Feb 24 02:15:27 1998::xnuab@sgahgahv.gov::888257727-5-8
Thu Jun 1 14:20:55 1989::rdwqhu@xzazufffut.net::612681655-6-10
Mon Mar 6 14:36:59 1978::qabkezi@sehnxqcuxexf.net::258014219-7-12
Sun Apr 11 15:01:56 1982::agzp@sygikhagdasq.gov::387356516-4-12
1.9 Matching a string
import re
data = 'Wed Jul 22 08:42:15 2015::qaolc@ombddhysxuv.com::1437525736-347-28'
#pat_old = '^Mon|^Tue|^Wed|^Thu|^Fri|^Sta|^Sun'
pat = '^(Mon|Tue|Wed|Thu|Fri|Sta|Sun)'
m = re.match(pat, data)
print(type(m))
print(m.group(0))
pat2 = '^(\w{3})'
m2 = re.match(pat2, data)
print(type(m2))
print(m2.group(1))
pa3 = '.+(\d+-\d+-\d+)'
m3 = re.search(pa3, data)
print(type(m3))
print(m3.group())
m4 = re.match(pa3, data)
print(m4.group(1))
pa4 = '.+?(\d+-\d+-\d+)'
m5 = re.match(pa4, data)
print(m5.group(1))
pa5 = '.+::(\d+-\d+-\d+)'
m6 = re.match(pa5, data)
print(m6.group(1))
result:
<class '_sre.SRE_Match' at 0x89df00>
Wed
<class '_sre.SRE_Match' at 0x89df00>
Wed
<class '_sre.SRE_Match' at 0x89df00>
Wed Jul 22 08:42:15 2015::qaolc@ombddhysxuv.com::1437525736-347-28
6-347-28 //greedy
1437525736-347-28 //because the '?' behind of '.+',so none-greedy;(see above in 1.1)
1437525736-347-28
1.10 greedy and no-greedy
'.+' is greedy; '.+?' is not greedy.
2.用正则取出字母和数字,并将字母和数字分别输出
import re str = 'python1班' print(re.search(r'(\w+)(\d)', str).group(0)) #取全部匹配的 print(re.search(r'(\w+)(\d)', str).group(1)) #取第一个括号匹配的 print(re.search(r'(\w+)(\d)', str).group(2)) #取第二个括号匹配的 结果: python1 python 1