8.正则表达式

正则表达式

每一门语言关于正则表达式的定义都是一样的，正则表达式是一种独立的技术。

使用步骤

存在大量文本信息
找出规律
按照规律编写正则表达式

语法

字符串本身就是一个正则表达式

import re

s1 = '博主讲的太好了！已经三连xiaohu加关注，求课件！我的邮箱是 1234214@qq.com, 或者xiaohu是 3255@163.com 或者是xiaohu微信手机号 18356781451'
res1 = re.findall('xiaohu', s1)
print(res1) #['xiaohu', 'xiaohu', 'xiaohu']

[] 表示可选项

s1 = '博主讲的太好了！已经三连xiaohuq加关注，求课件！我的邮箱是 1234214@qq.com, 或者xiaohuw是 3255@163.com 或者是xiaohup微信手机号 18356781451'
res1 = re.findall(r'xiaohu[qwp]',s1)
print(res1) # ['xiaohuq', 'xiaohuw', 'xiaohup']

值范围

[a-z] 表示查找 a-z

s1 = '博主讲的太好了！已经三连xiaohuq加关注，求课件！我的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuw微信手机号 18356781451'
res1 = re.findall(r'xiaohu[a-z]',s1)
print(res1) #['xiaohuq', 'xiaohuw']

[A-Za-z]

因为A-z的ascii码是连续的，所以可以写成[A-z]

s1 = '博主讲的太好了！已经三连xiaohuq加关注，求课件！我xiaohuA的邮箱是 1234214@qq.com, 或者xiaohu53255@163.com 或xiaohu8者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu[A-Za-z]', s1)
print(res1) # ['xiaohuq', 'xiaohuA', 'xiaohuU']
res2 = re.findall(r'xiaohu[A-z]', s1)
print(res2)# ['xiaohuq', 'xiaohuA', 'xiaohuU']

[0-9]

s1 = '博主讲的太好了！已经三连xiaohuq加关注，求课件！我xiaohuA的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu[0-9]', s1)
print(res1) # ['xiaohu5', 'xiaohu8']

因为0-z的ascii码不连续，所以写成[0-z]会遗漏字符

s1 = '博主讲的太好了！已经三连xiaohu=加关注，求课件！我xiaohuA的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu8者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu[0-z]', s1)
print(res1) # ['xiaohu=', 'xiaohuA', 'xiaohu5', 'xiaohu8', 'xiaohuU']

\d 表示数字

s1 = '博主讲的太好了！已经三连xiaohuq加关注，求课件！我xiaohuA的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu\d\d', s1)
print(res1) # ['xiaohu89']

？表示出现了0次或1次

s1 = '博主讲的太好xiaohu了！已经三连xiaohuq加关注，求课件！我xiaohu124453的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu\d?', s1)
print(res1) #['xiaohu', 'xiaohu', 'xiaohu1', 'xiaohu5', 'xiaohu8', 'xiaohu']

+ 表示出现了1次或者n次

s1 = '博主讲的太好xiaohu了！已经三连xiaohuq加关注，求课件！我xiaohu124453的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu\d+', s1)
print(res1) # ['xiaohu124453', 'xiaohu5', 'xiaohu89']

* 表示出现了0次或者n次

s1 = '博主讲的太好xiaohu了！已经三连xiaohuq加关注，求课件！我xiaohu124453的邮箱是 1234214@qq.com, 或者xiaohu5是 3255@163.com 或xiaohu89者是xiaohuU微信手机号 18356781451'
res1 = re.findall(r'xiaohu\d*', s1)
print(res1) # ['xiaohu', 'xiaohu', 'xiaohu124453', 'xiaohu5', 'xiaohu89', 'xiaohu']

次数范围

{m,n} 表示出现的次数范围，m表示至少出现的次数，n表述最多出现的次数

s1 = '有一个同学的学号为sj2, 另一个同学的学号为sj0000001, 还有一个同学的学号为sj3101, 还有一个学生：sj322010'
res1 = re.findall(r'sj\d{2,6}', s1) 
print(res1) # ['sj000000', 'sj3101', 'sj322010']

{m,} 表示出现的次数，至少为m个，上不封顶

s1 = '有一个同学sj9的学号为sj11, 另一个同学的学号为sj001, 还有一个同学的学号为sj3101, 还有一个学生：sj322010, 新来的学生学号为：sj34567809'
res1 = re.findall(r'sj\d{2,}', s1)
print(res1) # ['sj11', 'sj001', 'sj3101', 'sj322010', 'sj34567809']

{m} 表示出现了m次

s1 = '有一个同学sj9的学号为sj331001, 另一个同学的学号为sj32100, 还有一个同学的学号为sj3101, 还有一个学生：sj322010, 新来的学生学号为：sj34567809'
res1 = re.findall(r'sj\d{6}', s1)
print(res1) # ['sj331001', 'sj322010', 'sj345678']

匹配指定手机号：1、183 153 173开头；2、最多是11位

s1 = '我有一个手机号是18347821932，另一个手机号是17386429189，还有一个手机号是15356878621，以前用过一个手机号13987648345'
res1 = re.findall(r'1[857]3\d{8}', s1)
print(res1) # ['18347821932', '17386429189', '15356878621']

\w 表示字母、数字、下划线或是其他文字字符

s1 = '我有一个邮箱是183478@qq.com，另一个邮箱是17386@163.com，还有一个邮箱是78621@_mail.com，以前用过一个邮箱139876@*q.com,183478@王q.com，78621@のmail.com'
res1 = re.findall(r'\d+@\w+\.com', s1)
print(res1) #['183478@qq.com', '17386@163.com', '78621@_mail.com', '183478@王q.com', '78621@のmail.com']

\W 表示除\w表示的字符之外都能匹配

s1 = '我有一个邮箱是hys183478@###.com，另一个邮箱是17386zcy@163.com，还有一个邮箱是786zrx21@gmail.com，以前用过一个邮箱139876@qq.com'
res1 = re.findall(r'\w+@\W+\.com', s1, re.ASCII)
print(res1) # ['hys183478@###.com']

^ 表示以某个字符串开头

# 定义一个正则表达式模式，匹配以数字开头的字符串
pattern = r'^\d'

# 测试字符串
test_strings = ["123abc", "abc123", "456def", "789ghi"]

# 遍历测试字符串并检查是否匹配
for string in test_strings:
    if re.match(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")

$ 表示以某个字符串结尾

# 定义一个正则表达式模式，匹配以数字结尾的字符串
pattern = r'\d$'

# 测试字符串
test_strings = ["abc", "def2", "ghi", "jkl4"]

# 遍历测试字符串并检查是否匹配
for string in test_strings:
    if re.search(pattern, string):
        print(f"'{string}' matches the pattern")
    else:
        print(f"'{string}' does not match the pattern")

() 表示分组

s1 = '有一个学生的身份证号为340123200312075687，另一个学生的身份证号是340122199705035414'
res1 = re.findall(r'340\d{3}(\d{4})\d{8}', s1)
print(res1) # ['2003', '1997']

s1 = '有一个学生的身份证号为340123200312075687，另一个学生的身份证号是340122199705035414'
res1 = re.findall(r'(340\d{3}(\d{4})\d{8})', s1)
print(res1) # [('340123200312075687', '2003'), ('340122199705035414', '1997')]

s1 = '有一个学生的身份证号为340123200312075687，另一个学生的身份证号是340122199705035414'
res1 = re.findall(r'(340\d{3}(\d{4})(\d{2})(\d{2}))', s1)
print(res1) # [('34012320031207', '2003', '12', '07'), ('34012219970503', '1997', '05', '03')]

|表示多个字符之间的或，使用小括号括起来

s1 = '有一个学生的身份证号为340123200312075687，另一个学生的身份证号是340122199705035414，另一个学生的身份证号是340110199705035414'
res1 = re.findall(r'(340(123|110)(\d{4})(\d{2})(\d{2}))', s1)
print(res1) # [('34012320031207', '123', '2003', '12', '07'), ('34011019970503', '110', '1997', '05', '03')]

. 表示任意字符

s1 = '我有一个键盘，键盘的售卖序列号为JP2134WFWFasd##&13000, 上一个键盘的序列号为JPIUYT4WFqw34sd##&000'
res1 = re.findall(r'JP.{16}', s1)
print(res1) # ['JP2134WFWFasd##&13', 'JPIUYT4WFqw34sd##&']

使用\转义字符，将.变成普通的点字符进行匹配

s1 = '我有shujia#888一个键盘，键盘的售shujia.666卖序列号为JP2134WFW.asd##&13, 上一个键盘的序列号为JPIUYT4WFqw34sd##&'
res1 = re.findall(r'shujia\.\d{3}', s1)
print(res1) # ['shujia.666']

常用函数

re.findall 在大字符串中查找符合正则表达式特点的式子
re.match() 匹配整个字符串是否符合某个正则表达式特点

re.search() 从左向右匹配正则表达式，只会匹配一次符合条件, 得到的是一个对象

text = '博主讲的实在是太1165872335@数加.com好了，通俗易懂，已三连，求课件，我的邮箱是1165872335@qq.com或' \
       '者是xiaohu2023666@pronton.com谢谢博主 手xiaohu2机微信号也可以17354074069'

res1 = re.search(r'1\d+@\w+\.com',text)
print(res1) # <re.Match object; span=(8, 25), match='1165872335@数加.com'>
print(res1.group()) # 1165872335@数加.com

re.split()

text = '1001,xiaohu#18$踢足球'

res1 = re.split(r'[,#$]',text)
print(res1) # ['1001', 'xiaohu', '18', '踢足球']

re.finditer() 在大字符串中查找符合正则表达式特点的式子,得到的是一个迭代器

text = '博主讲的实在是太1165872335@数加$.com好了，通俗易懂，已三连，求课件，我的邮箱是 1165872335@qq.com 或' \
       '者是xiaohu2023666@pronton.com谢谢博主 手xiaohu2机微信号也可以17354074069'


res1 = re.finditer('(\w+@(数加\$|qq|pronton)\.com)',text, re.ASCII)
for res in res1:
    print(res.group(1))
    print(res.group(2))
'''1165872335@数加$.com
数加$
1165872335@qq.com
qq
xiaohu2023666@pronton.com
pronton'''

re.fullmatch() 将字符串整体与正则表达式进行匹配

text = '安徽省-合肥市-蜀山区-浮山路'

res1 = re.fullmatch(f'(\w+)-(\w+)-(\w+)-(\w+)', text)
print(f"省份:{res1.group(1)}")
print(f"市:{res1.group(2)}")
print(f"区:{res1.group(3)}")
print(f"街道:{res1.group(4)}")

posted @ 2024-12-06 21:36 WangYao_BigData 阅读(12) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 3.数据类型

· 1.输入输出

· PythonDay8Advance

· 正则表达式&numpy 12月2日到12月3日

· python基础学习8

阅读排行：
· 25岁的心里话
· 闲置电脑爆改个人服务器（超详细） #公网映射 #Vmware虚拟网络编辑器
· 基于 Docker 搭建 FRP 内网穿透开源项目（很简单哒）
· 零经验选手，Compose 一天开发一款小游戏！
· 一起来玩mcp_server_sqlite，让AI帮你做增删改查！！

公告

昵称： WangYao_BigData
园龄： 3个月
粉丝： 0
关注： 1

+加关注

2025年3月

日

一

二

三

四

五

六

wy56297

8.正则表达式

正则表达式

使用步骤

语法

常用函数

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜