re模块函数模式详解

re模块

python爬虫过程中,实现页面元素解析的方法很多,正则解析只是其中之一,常见的还有BeautifulSoup和lxml,它们都支持网页HTML元素解析,re模块提供了强大的正则表达式功能

re模块常用方法

compile(pattern,flags=0) :用于编译一个正则表达式字符串,生成一个re.pattern对象
- 参数: pattern:正则表达式字符串,flags:可选标志,用于修改匹配行为(下面会详细讲解)
```
pattern = re.compile(r'a.*e')
```
search(pattern,string,flag=0):搜索字符串中第一个匹配,返回一个匹配对象,否则返回None
- 参数:pattern:正则表达式字符串或编译后的re.pattern对象,string:要搜索的字符串,flags:用于修改匹配行为（如 re.IGNORECASE、re.MULTILINE 等)
- 返回值:返回一个re.Match对象或None
```
result = re.search(r'\bcat\b', 'The cat sat on the mat')
#输出<re.Match object; span=(4, 7), match='cat'> 接下来会解释
```
match(pattern,string,flags=0):从字符串的开头匹配,如果匹配成功返回一个匹配对象,否则返回None
- 参数pattern:正则表达式字符串或编译后的re.Pattern 对象,string:要匹配的字符串,flags:可选的标志,用于修改匹配行为
- 返回值:返回一个re.Match对象或None
```
result = re.search(r'\bcat\b', 'The cat sat on the mat')
```
findall(pattern,string,flags=0):返回所有匹配的子串列表
- 返回值:一个包含所有匹配子串的列表
```
result = re.findall(r'\bcat\b', 'The cat sat on the cat mat')
#输出['cat', 'cat']
```

finditer(pattern,string,flags=0) :返回一个迭代器,产生所有匹配的 Match对象

返回值:一个迭代器，产生 re.Match 对象

result = re.finditer(r'\bcat\b', 'The cat sat on the cat mat')
for match in result:
    print(match)
#<re.Match object; span=(4, 7), match='cat'>
#<re.Match object; span=(19, 22), match='cat'>

sub(pattern,repl,string,count=0,flags=0):替换字符串中所有匹配的子串
- 参数:repl:替换字符串或替换函数,count:可选参数,指定最大替换次数,默认为0(替换所有匹配)
- 返回值:替换后的字符串
```
result = re.sub(r'\bcat\b', 'dog', 'The cat sat on the cat mat')
print(result)
#The dog sat on the dog mat
```
split(pattern,string,maxsplit=0,flags=0) :根据匹配的字串分割字符串
- 参数:maxsplit:可选参数,指定最大分割次数,默认为0(不限分割次数)
- 返回值:一个包含分割结果的列表
```
text = "The cat sat on the cat mat"
result = re.split(r'\b', text)
print(result)
#['', 'The', ' ', 'cat', ' ', 'sat', ' ', 'on', ' ', 'the', ' ', 'cat', ' ', 'mat', '']
```

flags模式标志位

模式匹配支持多种模式标志,用于修改匹配模式行为

`re.IGNORECASE或re.I`

忽略大小写匹配模式

eg:

text = "The cat sat on the cat mat"
regex = re.compile("CAT", re.IGNORECASE)
result = regex.search(text)
if result:
    print(result.group())
else:
    print("not found")
#output-> cat

`re.MULTILINE或re.M`

多行模式,使 ^ 和 $ 匹配每行的开始和结束

eg:

text = """first Line
second Line2
third Line3"""

regex = re.compile(r'^\w+', re.MULTILINE)
result = regex.findall(text)
print(result)
#output->['first', 'second', 'third']

`re.DOTALL 或 re.S`

可以使得. 匹配包括换行符在内的所有字符

eg:

text = "abc\ndef\nghi"

regex = re.compile(r'.*', re.DOTALL)
result = regex.findall(text)
print(result)
#output->['abc\ndef\nghi', '']

`re.VERBOSE 或 re.X`

允许在正则表达式中使用注释和空白

eg:

	text = "The cat sat on the mat"
	
	regex = re.compile(r"""
	                \b #单词边界
	                cat#匹配"cat"
	                \b#单词边界
	""", re.VERBOSE)
	result = regex.findall(text)
	print(result)
#output->['cat']

内嵌模式标志

内嵌模式的功能与flags标志位的功能一致,与之不同的是使用内嵌模式标志和注释可以增强正则表达式的可读性和灵活性

常用的内嵌模式标志

(?i):忽略大小写

eg:

text = "hello world"

regex = re.compile(r"(?#忽略大小写)(?i)HELLO WORLD")
# (?#...) 是一个注释语法
result = regex.match(text).group()
print(result)

(?m):多行模式,使 ^ 和 $ 匹配每行的开始和结束，而不仅仅是整个字符串的开始和结束

eg:

text = """first Line
second Line
third Line"""

regex = re.compile(r"(?#多行模式)(?m)^\w+")
result = regex.findall(text)
print(result)
#output->['first', 'second', 'third']

(?s):点匹配所有字符模式

eg:

text = "abc\ndef\nghi"

regex = re.compile(r"(?#.号匹配换行符在内的所有字符)(?s).*")
result = regex.findall(text)
print(result)
#output->['abc\ndef\nghi', '']

(?x):允许注释和空白

eg:

text = "The cat sat on the mat"

regex = re.compile(r"(?#允许注释)(?ix)\bCAT\b")
# ix->忽略大小写的注释模式
result = regex.findall(text)
print(result)
#output->['cat']

re.Match 对象

re.Match对象是re模块中的一个类,用于表示正则表达式匹配的结果,当使用re.search,re,match或re.finditer方法时,若找到了匹配项,这些方法就会返回一个re.Match对象

re.Match返回字段解释

当使用print打印re.Match对象通常会返回<re.Match object; span=(4, 7), match='cat'>这种类似的字段

<re.Match object;:该部分表示的这是一个re.Match对象
span=(4, 7):表示一个元组,表示的是匹配字串在原始字符串中起始索引和结束索引
match='cat’:match是一个字符串,表示实际匹配的字串

访问 re.Match 对象的属性

通过访问re.Match对象的属性,我们可以获取更多信息

group():返回匹配的子字符串,如match.group()返回’cat’
start():返回匹配字串在原始字符串的起始索引
end():返回匹配字串在原始字符串的结束索引
span():返回元组表示字串的起始与结束索引

posted @ 2024-11-26 21:00 ihav2carryon 阅读(195) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· python爬虫正则表达式详解

· BeautifulSoup(bs4)细致讲解

· Python | 正则表达式(re模块)

· 【python】re模块

· python基础教程：re模块用法详解

阅读排行：
· DeepSeek “源神”启动！「GitHub 热点速览」
· 我与微信审核的“相爱相杀”看个人小程序副业
· 上周热点回顾（2.17-2.23）
· 如何使用 Uni-app 实现视频聊天（源码，支持安卓、iOS）
· 微软正式发布.NET 10 Preview 1：开启下一代开发框架新篇章

公告

昵称： ihav2carryon
园龄： 10个月
粉丝： 14
关注： 6

+加关注

2025年2月

日

一

二

三

四

五

六

ihave2carryon

re模块函数模式详解

re模块

re模块常用方法

flags模式标志位

`re.IGNORECASE或re.I`

`re.MULTILINE或re.M`

`re.DOTALL 或 re.S`

`re.VERBOSE 或 re.X`

内嵌模式标志

常用的内嵌模式标志

re.Match 对象

re.Match返回字段解释

访问 re.Match 对象的属性

公告

搜索

常用链接

我的标签

随笔分类

随笔档案

阅读排行榜

推荐排行榜

ihave2carryon

re模块 函数模式详解

re模块

re模块常用方法

flags模式标志位

re.IGNORECASE或re.I

re.MULTILINE或re.M

re.DOTALL 或 re.S

re.VERBOSE 或 re.X

内嵌模式标志

常用的内嵌模式标志

re.Match 对象

re.Match返回字段解释

访问 re.Match 对象的属性

公告

搜索

常用链接

我的标签

随笔分类

随笔档案

阅读排行榜

推荐排行榜

re模块函数模式详解

`re.IGNORECASE或re.I`

`re.MULTILINE或re.M`

`re.DOTALL 或 re.S`

`re.VERBOSE 或 re.X`