What do 'lazy' and 'greedy' mean in the context of regular expressions?
What do 'lazy' and 'greedy' mean in the context of regular expressions?
回答1
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>
. Suppose you have the following:
<em>Hello World</em>
You may think that <.+>
(.
means any non newline character and +
means one or more) would only match the <em>
and the </em>
, when in reality it will be very greedy, and go from the first <
to the last >
. This means it will match <em>Hello World</em>
instead of what you wanted.
Making it lazy (<.+?>
) will prevent this. By adding the ?
after the +
, we tell it to repeat as few times as possible, so the first >
it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
https://regexr.com/
根据这个网站的解释,问号的作用是Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.
回答2
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l
matches 'hell'
in 'hello'
but the lazy h.+?l
matches 'hel'
.
回答3
Greedy quantifier | Lazy quantifier | Description |
---|---|---|
* |
*? |
Star Quantifier: 0 or more |
+ |
+? |
Plus Quantifier: 1 or more |
? |
?? |
Optional Quantifier: 0 or 1 |
{n} |
{n}? |
Quantifier: exactly n |
{n,} |
{n,}? |
Quantifier: n or more |
{n,m} |
{n,m}? |
Quantifier: between n and m |
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o
output: stackoverflow
lazy reg expression : s.*?o
output: stackoverflow
What is the difference between .*? and .* regular expressions?
回答1
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ?
as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
Example 1: From A to Z
Let's compare these two patterns: A.*Z
and A.*?Z
.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z
yields 1 match:AiiZuuuuAoooZ
(see on rubular.com)A.*?Z
yields 2 matches:AiiZ
andAoooZ
(see on rubular.com)
Let's first focus on what A.*Z
does. When it matched the first A
, the .*
, being greedy, first tries to match as many .
as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z
doesn't match, the engine backtracks, and .*
must then match one fewer .
:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z
can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z
first matches as few .
as possible, and then taking more .
as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*?
is used instead of the greedy .*
to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z
also finds the same two matches as the A.*?Z
pattern for the above input (as seen on ideone.com). [^Z]
is what is called a negated character class: it matches anything but Z
.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
- regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ
yields 1 match:AuuZZ
(as seen on ideone.com)A.*?ZZ
yields 1 match:AiiZooAuuZZ
(as seen on ideone.com)A.*ZZ
yields 1 match:AiiZooAuuZZeeeZZ
(as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
回答2
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100
.
Using 1.*1
, *
is greedy - it will match all the way to the end, and then backtrack until it can match 1
, leaving you with 1010000000001
..*?
is non-greedy. *
will match nothing, but then will try to match extra characters until it matches 1
, eventually matching 101
.
All quantifiers have a non-greedy mode: .*?
, .+?
, .{2,6}?
, and even .??
.
In your case, a similar pattern could be <([^>]*)>
- matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than >
in-between <
and >
).
作者:Chuck Lu GitHub |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 没有源码,如何修改代码逻辑?
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
2016-11-04 DevExpress中使用DocumentManager,并确保不重复
2016-11-04 DevExpress所有的窗体,使用同一款皮肤
2015-11-04 error: dst ref refs/heads/zhCN_v0.13.1 receives from more than one src.