【362】python 正则表达式

正则表达式就是为字符串定义一个规则，符合这个规则就认为是“匹配”。
正则表达式使用字符串表示的，需了解如何用字符来描述字符。

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

https://docs.python.org/3/library/re.html#re.match

https://docs.python.org/3/library/re.html#match-objects

 re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门")
# <re.Match object; span=(0, 21), match='103.2465406,26.405230'>
 
re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门").span() 
# (0, 21)
 
re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门").group() 
# '103.2465406,26.405230'

　　除了简单地判断是否匹配之外，正则表达式还有提取子串的强大功能。用()表示的就是要提取的分组（Group）。比如：

^(\d{3})-(\d{3,8})$分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码

re.match(pattern, string, flags=0)

test  = '用户输入的字符串'
if re.match(r '正则表达式' , test):
    print ( 'ok' )
else :
    print ( 'failed' )

re.search 扫描整个字符串并返回第一个成功的匹配。

re.search(pattern, string, flags=0)
span()：返回搜索的索引区间
group()：返回匹配的结果

re.compile 编译生成一个正则表达式对象，可以用来使用其 match() 和 search() 等方法

　　下面的代码可以实现相同的效果

prog  = re. compile (pattern)
result  = prog.match(string)
 
# is equivalent to
 
result  = re.match(pattern, string)

　　主要作用就是可以重用对象，提高效率，类似使用函数

re.sub 用于替换字符串中的匹配项。

re.sub(pattern, repl, string, count=0, flags=0)
re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

ref: Python re.sub Examples

☀☀☀<< 举例 >>☀☀☀

import re
name  = "alex@bingnan#is a good boy!!!! Hahahaha-?-=-=-=+_+_+_+_+$%$%#@#@#$!#@)(!$&)*#(@)*$#(@467749237492365)"
name_alpha  = re.sub( "[^a-zA-Z]" ,  " " , name)
print (name_alpha)
# Eliminate duplicate whitespaces
print (re.sub(r "\s+" ,  " " , name_alpha))
 
# output
# alex bingnan is a good boy     Hahahaha
# alex bingnan is a good boy Hahahaha

Python 的re模块提供了re.sub用于替换字符串中的匹配项。

compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。

findall 在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。

注意： match 和 search 是匹配一次 findall 匹配所有。

finditer 和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。

split 方法按照能够匹配的子串将字符串分割后返回列表

\d 可以匹配一个数字；

\d matches any digit, while \D matches any nondigit:

\w 可以匹配一个字母或数字或下划线；

\w matches any character that can be part of a word (Python identifier), that is, a letter, the underscore or a digit, while \W matches any other character:

\W 可以匹配非数字字母下划线；

\s 表示一个空白格（也包括Tab、回车等空白格）；

\s matches any space, while \S matches any nonspace character:

. 表示任意字符；

* 表示任意字符长度（包括0个）（>=0）；（其前面的一个字符，或者通过小括号匹配多个字符）

+ 表示至少一个字符（>=1）；与前面字符合并解析，如\s可以匹配一个空格（也包括Tab等空白符），所以\s+表示至少有一个空格，例如匹配' '，' '等；

# 匹配最左边，即是0个字符
>>> re.search( '\d*' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 0 ,  0 ), match = ''>
 
# 匹配最长
>>> re.search( '\d\d\d*' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 1 ,  7 ), match = '123456' >
 
>>> re.search( '\d\d*' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 1 ,  7 ), match = '123456' >
 
# 两个的倍数匹配
>>> re.search( '\d(\d\d)*' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 1 ,  6 ), match = '12345' >

+ 表示至少一个字符（>=1）；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search( '.\d+' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 0 ,  7 ), match = 'a123456' >
 
>>> re.search( '(.\d)+' ,  'a123456b' )
<_sre.SRE_Match  object ; span = ( 0 ,  6 ), match = 'a12345' >

? 表示0个或1个字符；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search( '\s(\d\d)?\s' ,  'a 12 b' )
<_sre.SRE_Match  object ; span = ( 1 ,  5 ), match = ' 12 ' >
 
>>> re.search( '\s(\d\d)?\s' ,  'a  b' )
<_sre.SRE_Match  object ; span = ( 1 ,  3 ), match = '  ' >
 
>>> re.search( '\s(\d\d)?\s' ,  'a 1 b' )
# 无返回值，没有匹配成功

[ ] 匹配，同时需要转义的字符，在里面不需要，如 [.] 表示点

>>> re.search( '[.]' ,  'abcabc.123456.defdef' )
 
>>>  # 一次匹配中括号里面的任意字符
>>> re.search( '[cba]+' ,  'abcabc.123456.defdef' )
 
>>> re.search( '.[\d]*' ,  'abcabc.123456.defdef' )
 
>>> re.search( '\.[\d]*' ,  'abcabc.123456.defdef' )
 
>>> re.search( '[.\d]+' ,  'abcabc.123456.defdef' )

{n} 表示n个字符；与前面字符匹配来解析，例如\d{3}表示匹配3个数字，如'010'；

{n,m} 表示n-m个字符；与前面字符匹配来解析，例如\d{3,8}表示3-8个数字，例如'1234567'。

[0-9a-zA-Z\_] 可以匹配一个数字、字母或者下划线；

[0-9a-zA-Z\_]+ 可以匹配至少由一个数字、字母或者下划线组成的字符串，比如'a100'，'0_Z'，'Py3000'等等；

[a-zA-Z\_][0-9a-zA-Z\_]* 可以匹配由字母或下划线开头，后接任意个由一个数字、字母或者下划线组成的字符串，也就是Python合法的变量；

[a-zA-Z\_][0-9a-zA-Z\_]{0, 19} 更精确地限制了变量的长度是1-20个字符（前面1个字符+后面最多19个字符）。

- 在 [] 中表示范围，如果横线挨着中括号则被视为真正的横线
举例：如果要匹配'010-12345'这样的号码呢？由于'-'是特殊字符，在正则表达式中，要用'\'转义，所以，上面的正则是\d{3}\-\d{3,8}。
Ranges of letters or digits can be provided within square brackets, letting a hyphen separate the first and last characters in the range. A hyphen placed after the opening square bracket or before the closing square bracket is interpreted as a literal character:

>>> re.search( '[e-h]+' ,  'ahgfea' )
 
>>> re.search( '[B-D]+' ,  'ABCBDA' )
 
>>> re.search( '[4-7]+' ,  '154465571' )
 
>>> re.search( '[-e-gb]+' ,  'a--bg--fbe--z' )
 
>>> re.search( '[73-5-]+' ,  '14-34-576' )

^ 在 [ ] 中表示后面字符除外的其他字符

Within a square bracket, a caret after placed after the opening square bracket excludes the characters that follow within the brackets:

>>> re.search( '[^4-60]+' ,  '0172853' )
 
>>> re.search( '[^-u-w]+' ,  '-stv' )

A|B 可以匹配A或B，所以(P|p)ython可以匹配'Python'或者'python'。

Whereas square brackets surround alternative characters, a vertical bar separates alternative patterns:

>>> re.search( 'two|three|four' ,  'one three two' )
 
>>> re.search( '|two|three|four' ,  'one three two' )
 
>>> re.search( '[1-3]+|[4-6]+' ,  '01234567' )
 
>>> re.search( '([1-3]|[4-6])+' ,  '01234567' )
 
>>> re.search( '_\d+|[a-z]+_' ,  '_abc_def_234_' )
 
>>> re.search( '_(\d+|[a-z]+)_' ,  '_abc_def_234_' )

^ 表示行的开头，^\d表示必须以数字开头。

$ 表示行的结束，\d$表示必须以数字结束。

A caret at the beginning of the pattern string matches the beginning of the data string; a dollar at the end of the pattern string matches the end of the data string:

>>> re.search( '\d*' ,  'abc' )
 
>>> re.search( '^\d*' ,  'abc' )
 
>>> re.search( '\d*$' ,  'abc' )
 
>>> re.search( '^\d*$' ,  'abc' )
 
>>> re.search( '^\s*\d*\s*$' ,  ' 345 ' )

如果不在最前或最后，可以视为普通字符，但是在最前最后的时候想变成普通字符需要加上反斜杠

Escaping a dollar at the end of the pattern string, escaping a caret at the beginning of the pattern string or after the opening square bracket of a character class, makes dollar and caret lose the special meaning they have in those contexts context and let them be treated as literal characters:

>>> re.search( '\$' ,  '$*' )
 
>>> re.search( '\^' ,  '*^' )
 
>>> re.search( '[\^]' ,  '^*' )
 
>>> re.search( '[^^]' ,  '^*' )

^(\d{3})-(\d{3,8})$ 分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码：

group(0)：永远是原始字符串；
group(1)：表示第1个子串；
group(2)：表示第2个子串，以此类推。

分组顺序：按照左括号的顺序开始

Parentheses allow matched parts to be saved. The object returned by re.search() has a group() method that without argument, returns the whole match and with arguments, returns partial matches; it also has a groups()method that returns all partial matches:

>>> R  = re.search( '((\d+) ((\d+) \d+)) (\d+ (\d+))' ,
              '  1 23 456 78 9 0 '
             )
 
>>> R
 
 
>>> R.group()
'1 23 456 78 9'
 
>>> R.groups()
( '1 23 456' ,  '1' ,  '23 456' ,  '23' ,  '78 9' ,  '9' )
 
>>> [R.group(i)  for i  in range ( len (R.groups())  + 1 )]
[ '1 23 456 78 9' ,  '1 23 456' ,  '1' ,  '23 456' ,  '23' ,  '78 9' ,  '9' ]

?: 二选一，括号不计入分组

>>> R  = re.search( '([+-]?(?:0|[1-9]\d*)).*([+-]?(?:0|[1-9]\d*))' ,
              ' a = -3014, b = 0 '
             )
 
>>> R
 
 
>>> R.groups()
( '-3014' ,  '0' )

.* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符

模式描述

^ 匹配字符串的开头

$ 匹配字符串的末尾。

. 匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。

[...] 用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'

[^...] 不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。

re* 匹配0个或多个的表达式。

re+ 匹配1个或多个的表达式。

re? 匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式

re{ n} 匹配n个前面表达式。例如，"o{2}"不能匹配"Bob"中的"o"，但是能匹配"food"中的两个o。

re{ n,} 精确匹配n个前面表达式。例如，"o{2,}"不能匹配"Bob"中的"o"，但能匹配"foooood"中的所有o。"o{1,}"等价于"o+"。"o{0,}"则等价于"o*"。

re{ n, m} 匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式

a| b 匹配a或b

(re) 匹配括号内的表达式，也表示一个组

(?imx) 正则表达式包含三种可选标志：i, m, 或 x 。只影响括号中的区域。

(?-imx) 正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。

(?: re) 类似 (...), 但是不表示一个组

(?imx: re) 在括号中使用i, m, 或 x 可选标志

(?-imx: re) 在括号中不使用i, m, 或 x 可选标志

(?#...) 注释.

(?= re) 前向肯定界定符。如果所含正则表达式，以 ... 表示，在当前位置成功匹配时成功，否则失败。但一旦所含表达式已经尝试，匹配引擎根本没有提高；模式的剩余部分还要尝试界定符的右边。

(?! re) 前向否定界定符。与肯定界定符相反；当所含表达式不能在字符串当前位置匹配时成功。

(?> re) 匹配的独立模式，省去回溯。

\w 匹配数字字母下划线

\W 匹配非数字字母下划线

\s 匹配任意空白字符，等价于 [\t\n\r\f]。

\S 匹配任意非空字符

\d 匹配任意数字，等价于 [0-9]。

\D 匹配任意非数字

\A 匹配字符串开始

\Z 匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串。

\z 匹配字符串结束

\G 匹配最后匹配完成的位置。

\b 匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。

\B 匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。

\n, \t, 等。匹配一个换行符。匹配一个制表符, 等

\1...\9 匹配第n个分组的内容。

\10 匹配第n个分组的内容，如果它经匹配。否则指的是八进制字符码的表达式。

1	`-` `-` `-` `-`

举例：

\d{3} ：匹配3个数字

\s+ ：至少有一个空格

\d{3,8} ：3-8个数字

>>> mySent  = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
 
>>> mySent.split( ' ' )
[ 'This' ,  'book' ,  'is' ,  'the' ,  'best' ,  'book' ,  'on' ,  'Python' ,  'or' ,  'M.L.' ,  'I' ,  'have' ,  'ever' ,  'laid' ,  'eyes' ,  'upon.' ]
 
>>>  import re
 
>>> listOfTokens  = re.split(r '\W*' , mySent)
 
>>> listOfTokens
[ 'This' ,  'book' ,  'is' ,  'the' ,  'best' ,  'book' ,  'on' ,  'Python' ,  'or' ,  'M' ,  'L' ,  'I' ,  'have' ,  'ever' ,  'laid' ,  'eyes' ,  'upon' , '']
 
>>> [tok  for tok  in listOfTokens  if len (tok) >  0 ]
[ 'This' ,  'book' ,  'is' ,  'the' ,  'best' ,  'book' ,  'on' ,  'Python' ,  'or' ,  'M' ,  'L' ,  'I' ,  'have' ,  'ever' ,  'laid' ,  'eyes' ,  'upon' ]
 
>>> [tok.lower()  for tok  in listOfTokens  if len (tok) >  0 ]
[ 'this' ,  'book' ,  'is' ,  'the' ,  'best' ,  'book' ,  'on' ,  'python' ,  'or' ,  'm' ,  'l' ,  'i' ,  'have' ,  'ever' ,  'laid' ,  'eyes' ,  'upon' ]
 
>>> [tok.lower()  for tok  in listOfTokens  if len (tok) >  2 ]
[ 'this' ,  'book' ,  'the' ,  'best' ,  'book' ,  'python' ,  'have' ,  'ever' ,  'laid' ,  'eyes' ,  'upon' ]
>>>

参考：python爬虫（5）--正则表达式 - 小学森也要学编程 - 博客园

实现删除引号内部的内容，注意任意匹配使用【.*】

a =  'Sir Nina said: \"I am a Knight,\" but I am not sure'
b =  "Sir Nina said: \"I am a Knight,\" but I am not sure"
print(re.sub(r '"(.*)"' ,  '' , a),
re.sub(r '"(.*)"' ,  '' , b), sep= '\n' )
 
Output:
Sir Nina said:  but I am not sure
Sir Nina said:  but I am not sure

Example from Eric Martin's learning materials of COMP9021

The following function checks that its argument is a string:

that from the beginning: ^
consists of possibly some spaces: ␣*
followed by an opening parenthesis: \(
possibly followed by spaces: ␣*
possibly followed by either + or -: [+-]?
followed by either 0, or a nonzero digit followed by any sequence of digits: 0|[1-9]\d*
possibly followed by spaces: ␣*
followed by a comma: ,
followed by characters matching the pattern described by 1-7
followed by a closing parenthesis: \)
possibly followed by some spaces: ␣*
all the way to the end: $

Pairs of parentheses surround both numbers to match to capture them. For point 5, a surrounding pair of parentheses is needed; ?: makes it non-capturing:

>>>  def validate_and_extract_payoffs(provided_input):
    pattern  = '^ *\( *([+-]?(?:0|[1-9]\d*)) *,' \
              ' *([+-]?(?:0|[1-9]\d*)) *\) *$'
    match  = re.search(pattern, provided_input)
    if match:
        return (match.groups())
 
    
>>> validate_and_extract_payoffs( '(+0, -7 )' )
( '+0' ,  '-7' )
 
>>> validate_and_extract_payoffs( '  (-3014,0)  ' )
( '-3014' ,  '0' )

posted on 2019-01-27 08:24 McDelfino 阅读(327) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布：重大改进与新特性概览！
· .NET10 - 预览版1新功能体验（一）

alex_bn_lee

导航

公告

统计

搜索

常用链接

最新随笔

我的标签

积分与排名

随笔分类 (1762)

随笔档案 (1207)

相册 (9)

阅读排行榜

评论排行榜

推荐排行榜

最新评论

【362】python 正则表达式

	re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门")
	# <re.Match object; span=(0, 21), match='103.2465406,26.405230'>

	re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门").span()
	# (0, 21)

	re.match("\d+\.\d+,\d+.\d+", "103.2465406,26.405230西南门").group()
	# '103.2465406,26.405230'