python正则

正则表达式

正则表达式是包含文本和特殊字符的字符串,该字符串描述一个可以识别各种字符的模式。

特殊符号和字符

表示法

描述

例子

literal

匹配文本字符串的字面值literal

foo

re1|re2

匹配re1或者re2

foo|bar

.

匹配任何字符,除了\n

b.b

^

匹配起始部分

^a(以a开头)

$

匹配末尾部分

^/bin/*sh$

*

匹配0次或多次前面出现的正则表达式

[0-9]*

+

匹配1次或多次前面出现的正则表达式

[0-9]+

?

匹配0次或1次前面出现的正则表达式

[0-9]?

{N}

匹配N次前面出现的正则表达式

[0-9]{3}

{M,N}

匹配M~N次前面出现的正则表达式

[0-9]{3,7}

[...]

匹配中括号里任一字符

[abc]

[..x-y..]

匹配x-y范围内任一字符

[0-9a-zA-Z]

[^...]

不匹配中括号里面的任意一个字符

[^0-9a-zA-Z]

*|+||{})?

匹配上面频繁出现/重复符号的非贪婪版本(*+、?、{}

.*?[a-z]

()

匹配封闭的正则表达式,然后另存为子组

分组:到已经匹配到的数据中再提取数据

([0-9]{3}?, f(oo|u)bar

\d

匹配十进制数字,与[0-9]一致,\D与之相反

data\d+.txt

\w

匹配任何字母,与[A-Za-z0-9]相同,\W与之相反

[A-Za-z0-9]\w+

\s

匹配任何空格字符串,与[\n\t\r\v\f]相同,\S与之反

of\sthe

\b

匹配任何单词的边界,\B与之反

\bThe\b

\N

匹配已保存的字组N,参见上面的(..)

Price:\16

\c

逐字匹配任何特殊字符c

\. \\ \*

\A(\Z)

匹配字符串的起始(结束),参见^$

\ADear

标志:

re.Ire.IGNORECASE

大小写不敏感

re.Lre.LOCALE

根据所使用的本地语言环境通过\w,\W,\b,\B,\s,\S实现匹配

re.Mre.MUTILINE

^$分别匹配目标字符串的其实和结尾,而不是严格匹配整个字符串本身的起始和结尾

re.Sre.DOTALLA

"."匹配除了\n之外的所有单个字符;该标记表示'.'号能够匹配全部字符。

re.Xre.VERBOSE

通过反斜线转义,否则所有空格加上#(以及在该行中所有后续文字)都被忽略,除非在一个字符类或者允许注释并且提高可读性。


1.1 re.compile

re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.
The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
The sequence
prog = re.compile(pattern)
result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
如果打算做大量匹配和搜索操作,最好先编译正则表达式,以达到重复使用。模块级别的函数会将最近编译过的模式缓存起来,并不会消耗太多性能,但使用预编译,会减少查找和一些额外处理消耗。

1.2 re.search

re.search(pattern, string, flags=0)
浏览整个字符串去匹配第一个,未成功返回None
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

1.3 re.match

re.match(pattern, string, flags=0)
从开始位置匹配,匹配成功返回一个match对象,否则返回None
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
例如:
import re
# 匹配日期字符串格式
text = '11/27/2012'
if re.match(r'\d+/\d+/\d+', text):
    print('yes')
else:
    print('no')
m = re.match(r'\d+/\d+/\d+', text)
print(m.group())  # 11/27/2012

1.4 re.fullmatch

re.fullmatch(pattern, string, flags=0)
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

1.5 re.split

re.split(pattern, string, maxsplit=0, flags=0)
根据正则表达式的模式分隔符,split函数将字符串分割为列表,然后返回成功匹配的列表,分割操作maxsplit次。
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

1.6 re.findall

re.findall(pattern, string, flags=0)
查找字符串中所有的正则表达式模式,并返回一个匹配列表
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

1.7 re.dinditer

re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

1.8 re.sub

re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:
If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:
      >>> def dashrepl(matchobj):
...     if matchobj.group(0) == '-': return ' '
...     else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
例1:# 将日期换为‘today’
import re
text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
datePat = re.compile(r'\d+/\d+/\d+')
m = datePat.sub('today',text1)
print(m) # Today is today. PyCon starts today
例2:# 将日期格式转换,将11/27/2012转换为2012/11/27
text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
datePat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datePat.sub(r'\3-\1-\2', text1)
print(m)  # Today is 2012-11-27. PyCon starts 2013-3-13

例3:# 对于更复杂的替换,可以传递一个替换回调函数来实现
def dashrepl(matchobj):
    print(matchobj.group(0))  # --   --  -
if matchobj.group(0) == '-':
        return ' '
else:
        return '-'
m = re.sub('-{1,2}', dashrepl, 'pro----gram-files')
print(m)  # pro--gram files
例4:# 对于更复杂的替换,可以传递一个替换回调函数来实现
import re
text = 'UPPER PYTHON, lower python, Mixed Python'
def matchcase(word):
    print(word)
    # < _sre.SRE_Match object; span=(6, 12), match='PYTHON'>
    # < _sre.SRE_Match object; span=(20, 26), match='python'>
    # < _sre.SRE_Match object; span=(34, 40), match='Python'>
if word.group() == 'PYTHON':
        return 'SNAKE'
elif word.group() == 'python':
        return 'snake'
elif word.group() == 'Python':
        return 'Snake'
m = re.sub('python', matchcase, text, flags=re.IGNORECASE)
print(m) # UPPER SNAKE, lower snake, Mixed Snak
例5: 通用版
import re
text = 'UPPER PYTHON, lower python, Mixed Python'
def matchcase(word):
    # word 是 snake
def replace(m):
# < _sre.SRE_Match object; span=(6, 12), match='PYTHON'>
         # < _sre.SRE_Match object; span=(20, 26), match='python'>
        # < _sre.SRE_Match object; span=(34, 40), match='Python'>
        text = m.group()  # PYTHON python Python
if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace
m = re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
print(m)

1.9 re.subn

re.subn(pattern, repl, string, count=0, flags=0)
Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).
例1:# 将日期格式转换,将11/27/2012转换为2012/11/27, 并计算替换了多少次
text1 = 'Today is 11/27/2012. PyCon starts 3/13/2013'
datePat = re.compile(r'(\d+)/(\d+)/(\d+)')
m, n = datePat.subn(r'\3-\1-\2', text1)
print(m)  # Today is 2012-11-27. PyCon starts 2013-3-13
print(n)

1.10 re.escape

re.escape(string)
Escape all the characters in pattern except ASCII letters, numbers and '_'. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

1.11 re.purge

re.purge()
Clear the regular expression cache.

1.12 常用正则表达式

IP:

^(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}$

手机号:

^1[3|4|5|8][0-9]\d{8}$

邮箱:

[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)+

替换连续空格为单一空格
re.sub(r"[\x00-\x20]+", " ", value).strip()

 

posted @ 2017-02-07 10:04  hexm  阅读(311)  评论(0编辑  收藏  举报
联系我:xiaoming.unix@gmail.com