python 正则表达式
1. 检测工具
https://regex101.com/ 这个不要钱
https://www.regexbuddy.com/download.html 需要钱钱买license
是真的好用
2. 单字符匹配
. 匹配任意一个字符(除了\n) [] 匹配[]内列举的字符 \d 匹配数字0-9 \D 匹配非数字,不是数字的都行 \s 匹配空白,即 空格,\t tab键,\n 换行 \S 匹配非空白 \w 匹配单词字符,即A-Z, a-z, 0-9, _ \W 匹配非单词字符, 即非字母,非数字,非下划线
[hr] 即可以匹配单个字符h,又可以匹配单个字符r, [a-h] 可以匹配单个a , 可以匹配单个b, 可以匹配单个c, 可以匹配单个d, 可以匹配单个e, 可以匹配单个f, 可以匹配单个g, 可以匹配单个h [A-Z0-9] 匹配所有的大写字母和所有的数字
3. 匹配多个字符
* 匹配前一个字符任意次,即可有可无 + 匹配前一个字符1次或者无限次,即至少一次 ? 匹配前一个字符出行1次或者0次,即要么有1次,要么没有 {m} 匹配前一个字符m次 {m,n} 匹配前一个字符出现从m次到n次,m<n
举个栗子
^[a-zA-Z_]+\w* 匹配变量名 [0-9]?[0-9] 匹配0-99 \d{3} 连续出现4次数据 \d{8,20} 连续8到20位数字 \. 只能匹配. \代表转义字符,如果只写.,代表匹配任意字符
4. 匹配开头
^ 匹配后面一个字母开头;在中括号内 [^a]取反, 匹配不是a的字符 $ 匹配前面一个字母结尾 ^[a-z]\d$ 以小写字母开头,以数字结尾 [^he] 匹配不包含h,不包含e的单个字母
5. re模块
在python中需要通过正则表达式对字符串进行匹配的时候,可以使用re模块,这个模块里面有match(pattern,String,flag)方法, 如果匹配成功,返回object对象,如果匹配不成功,返回None。
举个栗子
import re if __name__ == '__main__': str = 'ddddffffff%' patobj = re.match('\w+', str) # group() 返回匹配的内容,这里返回ddddffffff print(patobj.group()) str2 = 'hekkko@163.com' patobj2 = re.match('^\w{4,20}@163\.com$', str2) print(patobj2.group()) 执行结果 PycharmProjects/pythonProject/p3/repat.py ddddffffff hekkko@163.com Process finished with exit code 0
6.匹配分组之 |
|匹配左右任意一个表达式
#匹配0-100 ^[0-9]?[0-9]$|^100$
import re
if __name__ == '__main__':
str3 = '100'
patobj3 = re.match('^[0-9]?[0-9]$|^100$', str3)
print(patobj3.group())
str4 = '99'
patobj4 = re.match('^[0-9]?[0-9]$|^100$', str4)
print(patobj4.group())
result
PycharmProjects/pythonProject/p3/repat.py
100
99
Process finished with exit code 0
7.匹配分组之()
()看成一个整体,进行整体匹配, (ab)将括号中字符ab作为一个分组
import re
if __name__ == '__main__':
str5 = 'helll@163.com'
patobj5 = re.match('^\w{4,20}@(163|126|qq)\.(com)$', str5)
print(patobj5.group())
print(patobj5.group(1)) #正则表达式里面第一个小括号的匹配内容
print(patobj5.group(2)) #正则表达式里面第二个小括号的匹配内容
C:\Users\GINAGZLI\.virtualenvs\p3-uIsbL7fV\Scripts\python.exe C:/Users/‘’‘’‘’/PycharmProjects/pythonProject/p3/repat.py
helll@163.com
163
com
Process finished with exit code 0
import re if __name__ == '__main__': str6 = '010-11111111' str7 = '1234-2222222' patobj6 = re.match('(\d{3,4})-(\d{7,8})', str6) patobj7 = re.match('(\d{3,4})-(\d{7,8})', str7) print('qu hao', patobj6.group(1)) print('dian hua hao ma', patobj6.group(2)) print('qu hao', patobj7.group(1)) print('dian hua hao ma', patobj7.group(2)) C:\Users\GINAGZLI\.virtualenvs\p3-uIsbL7fV\Scripts\python.exe C:/Users/uuuuu/PycharmProjects/pythonProject/p3/repat.py qu hao 010 dian hua hao ma 11111111 qu hao 1234 dian hua hao ma 2222222 Process finished with exit code 0
8. 匹配分组之 \
\num 表示引用第num个()里面的pattern
\1 表示引用第一个()里面的pattern
import re if __name__ == '__main__': str8 = "<html>testPattern</html>" # \1表示引用()里面的pattern 也就是这串([a-zA-Z0-9]+),这里需要用转义字符\\1机器才能按\1办事儿 patobj8 = re.match("<([a-zA-Z0-9]+)>.*</\\1>", str8) print(patobj8.group()) C:/Users/oooo/PycharmProjects/pythonProject/p3/repat.py <html>testPattern</html> Process finished with exit code 0
import re
if __name__ == '__main__':
str9 = "<html><h1>testPattern</h1></html>"
# patobj9 = re.match("<([a-zA-Z0-9]+)><\\1>.*</\\1></\\1>", str9) 我不清楚这个错在哪里, 这个匹配不到,不能多次引用同一个么? 拜托能为我解惑的朋友给留言
patobj9 = re.match("<([a-zA-Z0-9]+)><([a-zA-Z0-9]+)>.*</\\2></\\1>", str9) 这个能匹配到
print(patobj9.group())
9. 匹配分组之别名
import re if __name__ == '__main__': str10 = "<html><h1>testPattern</h1></html>" #?P<name1>是[a-zA-Z0-9]+的别名 ,(?P=name1)代表引用name1 patobj10 = re.match("<(?P<name1>[a-zA-Z0-9]+)><(?P<name2>[a-zA-Z0-9]+)>.*</(?P=name2)></(?P=name1)>", str10) print(patobj10.group())
10. re模块的其他用法
10.1 search 查找
re.search('hello','helloxxxx')能匹配成功,re.search('hello','xhelloxxxx')能匹配成功
match是从正则开头的才能匹配到,比如 re.match('hello','helloxxxx')能匹配成功,re.match('hello','xhelloxxxx')不能匹配成功
search的搜索范围比match大,直接在字符串里面搜索
import re if __name__ == '__main__': str11 = "dfgsdgad9999dddd" patobj11 = re.search('\d+', str11)
10.2 findall() 返回一个列表
import re if __name__ == '__main__': str12 = "dfgsdgad9999dddd, 5555,444,fff666,777" patobj12 = re.findall('\d+', str12) print(patobj12) C:\Users\GINAGZLI\.virtualenvs\p3-uIsbL7fV\Scripts\python.exe C:/Users/77777/PycharmProjects/pythonProject/p3/repat.py ['9999', '5555', '444', '666', '777'] Process finished with exit code 0
10.3 sub 替换
sub("正则pattern", new , 要替换的字符串)
返回值是替换后的字符串
import re if __name__ == '__main__': str13 = "dfgsdgad9999dddd, 5555,444,fff666,777" patobj13 = re.sub('\d+','10000', str13) print(patobj13) C:\Users\ooooo\.virtualenvs\p3-uIsbL7fV\Scripts\python.exe C:/Users/ooooo/PycharmProjects/pythonProject/p3/repat.py dfgsdgad10000dddd, 10000,10000,fff10000,10000 Process finished with exit code 0
三个双引号定义一个多行的字符串
strhtml = """
<p>1 src="vender.e349f038.js"</p>
<script type="text/javascript" src="runtime~app.e349f038.js"></script>
<div class="reminders Football close" style="top: 0px; left: 0px;"></div>
<script type="text/javascript" src="runtime~app.e349f038.js"></script>
<p>5 type="text/javascript" src="app.e349f038.js"</p>
"""
10.4 split
import re if __name__ == '__main__': str14 = "hi:hello icsics open,world" patobj14 = re.split(':|,| ', str14) print(patobj14)
C:\Users\66666\.virtualenvs\p3-uIsbL7fV\Scripts\python.exe C:/Users/555555/PycharmProjects/pythonProject/p3/repat.py
['hi', 'hello', 'icsics', 'open', 'world']
Process finished with exit code 0
10.5 贪婪和非贪婪
在 * ? + {} 后面加个?会把贪婪模式变成非贪婪模式
import re if __name__ == '__main__': str14 = "aaa123456" patobj14 = re.match('aaa\d+?', str14) print(patobj14) <re.Match object; span=(0, 4), match='aaa1'> Process finished with exit code 0 import re if __name__ == '__main__': str14 = "aaa123456" patobj14 = re.match('aaa(\d+?)', str14) print(patobj14) <re.Match object; span=(0, 4), match='aaa1'> Process finished with exit code 0 import re if __name__ == '__main__': str14 = "aaa123456" patobj14 = re.match('aaa[\d+?]', str14) print(patobj14) <re.Match object; span=(0, 4), match='aaa1'> Process finished with exit code 0
import re if __name__ == '__main__': str15 = '< span id = "tournament" class ="tab-title-label" > Tournament < / span >' #下面一句pattern中的?就是非贪婪匹配,只取第一个id的值, patobj15 = re.search('id = \".*?\"', str15) #这里我不太明白,我以为是下面这样写,但是这样写不对,请明白的朋友给留言解释,先谢过 #patobj15 = re.search('id = \".*\"?', str15) print(patobj15) <re.Match object; span=(10, 27), match='id = "tournament"'> Process finished with exit code 0
import re if __name__ == '__main__': str15 = '< span id = "tournament" class ="tab-title-label" > Tournament < / span >' patobj15 = re.search('id = \"(.*?)\"', str15) print(patobj15.group(1)) tournament Process finished with exit code 0
10.6 r的作用
让\只是斜杠的作用
import re if __name__ == '__main__': str16 = "<html><h1>testPattern</h1></html>" patobj16 = re.match("<([a-zA-Z0-9]+)><([a-zA-Z0-9]+)>.*</\\2></\\1>", str16) #不加r就这么写 \\2 \\1 , patobj16r = re.match(r"<([a-zA-Z0-9]+)><([a-zA-Z0-9]+)>.*</\2></\1>", str16) #加r就这么写 \2 \1 , 只对\ 不用转义了, 其他的 比如 \. 还是跟以前一样 print(patobj16.group()) print(patobj16r.group()) <html><h1>testPattern</h1></html> <html><h1>testPattern</h1></html> Process finished with exit code 0
抽空写的,持续6天,写完了,应该比较全了,如果还有要补充的,麻烦朋友们留言。