零宽断言 -- Lookahead/Lookahead Positive/Negative
http://www.vaikan.com/regular-expression-to-match-string-not-containing-a-word/
经常我们会遇到想找出不包含某个字符串的文本,程序员最容易想到的是在正则表达式里使用,
^(hede)
来过滤”hede”字串,但这种写法是错误的。
我们可以这样写:
[^hede]
,但这样的正则表达式完全是另外一个意思,它的意思是字符串里不能包含
‘h’,‘e’,‘d’三个但字符。那什么样的正则表达式能过滤出不包含完整“hello”字串的信息呢?
事实上,说正则表达式里不支持逆向匹配并不是百分之百的正确。就像这个问题,
我们就可以使用否定式查找来模拟出逆向匹配,从而解决我们的问题:
^( (?!hede). ) * $
上面这个表达式就能过滤出不包含‘hede’字串的信息。我上面也说了,这种写法并不是正则表达式“擅长”的用法,但它是可以这样用的。
解释
一个字符串是由n个字符组成的。在每个字符之前和之后,都有一个空字符。
这样,一个由n个字符组成的字符串就有n+1个空字符串。我们来看一下 “ABhedeCD” 这个字符串:
+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+ S = |e1| A |e2| B |e3| h |e4| e |e5| d |e6| e |e7| C |e8| D |e9| +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+ index 0 1 2 3 4 5 6 7
后面 <-------------e3--------------------------------------->前面
所有的e编号的位置都是空字符。
表达式 (?!hede).
会往前查找,看看前面是不是没有“hede”字串,
如果没有(是其它字符),那么.
(点号)就会匹配这些其它字符。
这种正则表达式的“查找”也叫做“zero-width-assertions”(零宽度断言),
因为它不会捕获任何的字符,只是判断。
在上面的例子里,每个空字符都会检查其前面的字符串是否不是‘hede’,
如果不是,这.
(点号)就是匹配捕捉这个字符。表达式(?!hede).
只执行一次,
所以,我们将这个表达式用括号包裹成组(group),
然后用*
(星号)修饰——匹配0次或多次:((?!hede).)*
。
你可以理解,正则表达式((?!hede).)*
匹配字符串"ABhedeCD"
的结果false,
因为在e3
位置,(?!hede)
匹配不合格,它之前有"hede"
字符串,也就是包含了指定的字符串。
在正则表达式里, ?!
是否定式向前查找,它帮我们解决了字符串“不包含”匹配的问题。
零宽断言 | (?=exp) | 匹配exp前面的位置 <自身出现位置的后面> |
---|---|---|
(?<=exp) | 匹配exp后面的位置 <自身出现位置的前面> | |
(?!exp) | 匹配后面跟的不是exp的位置 <自身出现位置的前面> | |
(?<!exp) | 匹配前面不是exp的位置 <自身出现位置的后面> |
<<< [后面] 自身出现的位置 [前面] >>
表达式的前面 <---表达式---> 表达式的后面
(?=子表达式) 零宽度正预测先行断言。它断言自身出现的位置的[后面]能匹配表达式exp。 Lookahead Positive
(?<=子表达式) 零宽度正回顾后发断言。它断言自身出现的位置的[前面]能匹配表达式exp。lookbehind Positive
!表示非,就是不包含,同样是零宽度,不会被捕获。
(?!子表达式) 零宽度负预测先行断言。断言此位置的[后面]不能匹配表达式exp。Lookahead Negative !
(?<!子表达式) 零宽度负回顾后发断言。断言此位置的[前面]不能匹配表达式exp。lookbehind Negative !
< -- lookbehind
空白 -- lookahead
! -- Negative
= -- Positive
((?!regex).)* : 这个就是不包含字符串"regex"的字符串。
零宽断言
接下来的四个用于查找在某些内容(但并不包括这些内容)之前或之后的东西,也就是说它们像\b,^,$那样用于指定一个位置,
这个位置应该满足一定的条件(即断言),因此它们也被称为零宽断言。最好还是拿例子来说明吧:
断言用来声明一个应该为真的事实。正则表达式中只有当断言为真时才会继续进行匹配。
(?=exp)也叫零宽度正预测先行断言,它断言自身出现的位置的[后面]能匹配表达式exp。
比如\b\w+(?=ing\b),匹配以ing结尾的单词的前面部分(除了ing以外的部分),
如查找I'm singing while you're dancing.时,它会匹配sing和danc。
(?<=exp)也叫零宽度正回顾后发断言,它断言自身出现的位置的[前面]能匹配表达式exp。
比如(?<=\bre)\w+\b会匹配以re开头的单词的后半部分(除了re以外的部分),
例如在查找reading a book时,它匹配ading。
假如你想要给一个很长的数字中每三位间加一个逗号(当然是从右边加起了),
你可以这样查找需要在前面和里面添加逗号的部分:
( (?<=\d)\d{3} ) + \b,用它对1234567890进行查找时结果是234567890。
下面这个例子同时使用了这两种断言:
(?<=\s) \d+ (?=\s) 匹配以空白符间隔的数字(再次强调,不包括这些空白符)。
abc 123 def
负向零宽断言
前面我们提到过怎么查找不是某个字符或不在某个字符类里的字符的方法(反义)。
但是如果我们只是想要确保某个字符没有出现,但并不想去匹配它时怎么办?
例如,如果我们想查找这样的单词--它里面出现了字母q,但是q后面跟的不是字母u,我们可以尝试这样:
\b\w*q[^u]\w*\b 匹配包含后面不是字母u的字母q的单词。
但是如果多做测试(或者你思维足够敏锐,直接就观察出来了),你会发现,
如果q出现在单词的结尾的话,像Iraq,Benq,这个表达式就会出错。
这是因为[^u]总要匹配一个字符,所以如果q是单词的最后一个字符的话,
后面的[^u]将会匹配q后面的单词分隔符(可能是空格,或者是句号或其它的什么),
后面的\w*\b将会匹配下一个单词,于是\b\w*q[^u]\w*\b就能匹配整个Iraq fighting。
负向零宽断言能解决这样的问题,因为它只匹配一个位置,并不消费任何字符。
现在,我们可以这样来解决这个问题:\b\w*q(?!u)\w*\b。
零宽度负预测先行断言 (?!exp),断言此位置的[后面]不能匹配表达式exp。
例如:\d{3}(?!\d)匹配三位数字,而且这三位数字的后面不能是数字;
\b ((?!abc)\w) +\b匹配不包含连续字符串abc的单词。
同理,我们可以用
零宽度负回顾后发断言(?<!exp), 来断言此位置的[前面]不能匹配表达式exp:
(?<![a-z])\d{7} 匹配 前面 不是小写字母的七位数字。
一个更复杂的例子:(?<=<(\w+)>).*(?=<\/\1>)匹配不包含属性的简单HTML标签内里的内容。
(?<=<(\w+)>)指定了这样的前缀:
被尖括号括起来的单词(比如可能是<b>),
然后是.*(任意的字符串),最后是一个后缀(?=<\/\1>)。
注意后缀里的\/,它用到了前面提过的字符转义;
\1则是一个反向引用,引用的正是捕获的第一组,前面的(\w+)匹配的内容,
这样如果前缀实际上是<b>的话,后缀就是</b>了。
整个表达式匹配的是<b>和</b>之间的内容(再次提醒,不包括前缀和后缀本身)。
Lookahead and Lookbehind Zero-Length Assertions
Lookahead<预测, 看前面, 考虑未来, 超前处理>, and
lookbehind<回顾, 回头, 向后看>, collectively called
lookaround<四顾;朝四周看> are zero-length assertions
just like the start ^ and end of line $ and start and end of word anchors.
The difference is that lookaround actually matches characters, but then gives up the match,
returning only the result: match or no match. That is why they are called
assertions<论断, 断言>
They do not consume characters in the string, but only assert whether a match is possible or not.
Lookaround allows you to create regular expressions that are impossible to create without them,
or that would get very longwinded without them.
Positive and Negative Lookahead
Negative lookahead is indispensable if you want to match something not followed by something else.
When explaining character classes, this tutorial explained why you cannot use a negated character class
to match a q not followed by a u.
Negative lookahead provides the solution: q(?!u)
The negative lookahead construct is the pair of parentheses,
with the opening parenthesis[(] followed by a question mark [?] and an exclamation point[!]
Inside the lookahead, we have the trivial regex u.
Positive lookahead works just the same. q(?=u)
matches a q that is followed by a u, without making the u part of the match.
The positive lookahead construct is a pair of parentheses,
with the opening parenthesis [(] followed by a question mark [?] and an equals sign [=]
You can use any regular expression inside the lookahead (but not lookbehind, as explained below).
Any valid regular expression can be used inside the lookahead.
If it contains capturing groups then those groups will capture as normal and backreferences to them will work normally,
even outside the lookahead.
(The only exception is Tcl, which treats all groups inside lookahead as non-capturing.)
The lookahead itself is not a capturing group.
It is not included in the count towards numbering the backreferences.
If you want to store the match of the regex inside a lookahead,
you have to put capturing parentheses around the regex inside the lookahead, like this:
(?=(regex))
The other way around will not work,
because the lookahead will already have discarded the regex match by the time the capturing group is to store its match.
Regex Engine Internals
First, let's see how the engine applies q(?!u) to the string Iraq.
The first token in the regex is the literal q.
As we already know, this causes the engine to traverse the string until the q in the string is matched.
The position in the string is now the void after the string.
The next token is the lookahead.
The engine takes note that it is inside a lookahead construct now,
and begins matching the regex inside the lookahead.
So the next token is u.
This does not match the void after the string.
The engine notes that the regex inside the lookahead failed.
Because the lookahead is negative, this means that the lookahead has successfully matched at the current position.
At this point, the entire regex has matched, and q is returned as the match.
Let's try applying the same regex to quit.
q matches q.
The next token is the u inside the lookahead.
The next character is the u.
These match.
The engine advances to the next character: i.
However, it is done with the regex inside the lookahead.
The engine notes success, and discards the regex match.
This causes the engine to step back in the string to u.
Because the lookahead is negative, the successful match inside it causes the lookahead to fail.
Since there are no other permutations of this regex, the engine has to start again at the beginning.
Since q cannot match anywhere else, the engine reports failure.
Let's take one more look inside, to make sure you understand the implications of the lookahead.
Let's apply q(?=u)i to quit.
The lookahead is now positive and is followed by another token.
Again, q matches q and u matches u.
Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u.
The lookahead was successful, so the engine continues with i.
But i cannot match u.
So this match attempt fails.
All remaining attempts fail as well, because there are no more q's in the string.
Positive and Negative Lookbehind
Lookbehind has the same effect, but works backwards.
It tells the regex engine to temporarily step backwards in the string,
to check if the text inside the lookbehind can be matched there.
(?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind.
It doesn't match cab, but matches the b (and only the b) in bed or debt.
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.
The construct for positive lookbehind is (?<=text):
a pair of parentheses, with the opening parenthesis followed by a question mark,
"less than" symbol, and an equals sign.
Negative lookbehind is written as (?<!text),
using an exclamation point instead of an equals sign.
More Regex Engine Internals
Let's apply (?<=a)b to thingamabob.
The engine starts with the lookbehind and the first character in the string.
In this case, the lookbehind tells the engine to step back one character, and see if a can be matched there.
The engine cannot step back one character because there are no characters before the t.
So the lookbehind fails, and the engine starts again at the next character, the h.
(Note that a negative lookbehind would have succeeded here.)
Again, the engine temporarily steps back one character to check if an "a" can be found there.
It finds a t, so the positive lookbehind fails again.
The lookbehind continues to fail until the regex reaches the m in the string.
The engine again steps back one character, and notices that the a can be matched there.
The positive lookbehind matches.
Because it is zero-length, the current position in the string remains at the m.
The next token is b, which cannot match here.
The next character is the second a in the string.
The engine steps back, and finds out that the m does not match a.
The next character is the first b in the string.
The engine steps back and finds out that a satisfies the lookbehind.
b matches b, and the entire regex has been matched successfully.
It matches one character: the first b in the string.
Important Notes About Lookbehind
The good news is that you can use lookbehind anywhere in the regex, not only at the start.
If you want to find a word not ending with an "s", you could use \b\w+(?<!s)\b.
This is definitely not the same as \b\w+[^s]\b.
When applied to John's, the former matches John and the latter matches John' (including the apostrophe).
I will leave it up to you to figure out why.
(Hint: \b matches between the apostrophe and the s).
The latter also doesn't match single-letter words like "a" or "I".
The correct regex without using lookbehind is \b\w*[^s\W]\b
(star instead of plus, and \W in the character class).
Personally, I find the lookbehind easier to understand.
The last regex, which works correctly, has a double negation (the \W in the negated character class).
Double negations tend to be confusing to humans. Not to regex engines, though.
(Except perhaps for Tcl, which treats negated shorthands in negated character classes as an error.)
The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind,
because they cannot apply a regular expression backwards.
The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind.
When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind,
steps back that many characters in the subject string, and then applies the regex inside the lookbehind
from left to right just as it would with a normal regex.
Many regex flavors, including those used by Perl and Python, only allow fixed-length strings.
You can use literal text,character escapes, Unicode escapes other than \X, and character classes.
You cannot use quantifiers or backreferences.
You can use alternation, but only if all alternatives have the same length.
These flavors evaluate lookbehind by first stepping back through the subject string for as many characters as the lookbehind needs,
and then attempting the regex inside the lookbehind from left to right.
PCRE is not fully Perl-compatible when it comes to lookbehind.
While Perl requires alternatives inside lookbehind to have the same length,
PCRE allows alternatives of variable length.
PHP, Delphi, R, and Ruby also allow this.
Each alternative still has to be fixed-length.
Each alternative is treated as a separate fixed-length lookbehind.
Java takes things a step further by allowing finite repetition.
You still cannot use the star or plus, but you can use thequestion mark and the curly braces with the max parameter specified.
Java determines the minimum and maximum possible lengths of the lookbehind.
The lookbehind in the regex (?<!ab{2,4}c{3,5}d) test has 6 possible lengths.
It can be between 7 to 11 characters long.
When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of characters (7 in this example) i
n the string and then evaluates the regex inside the lookbehind as usual, from left to right.
If it fails, Java steps back one more character and tries again.
If the lookbehind continues to fail, Java continues to step back until the lookbehind either matches
or it has stepped back the maximum number of characters (11 in this example).
This repeated stepping back through the subject string kills performance
when the number of possible lengths of the lookbehind grows.
Keep this in mind. Don't choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers inside lookbehind.
Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail
when it should succeed in some situations. These bugs were fixed in Java 6.
The only regex engines that allow you to use a full regular expression inside lookbehind,
including infinite repetition and backreferences, are the JGsoft engine and the .NET framework RegEx classes.
These regex engines really apply the regex inside the lookbehind backwards,
going through the regex inside the lookbehind and through the subject string from right to left.
They only need to evaluate the lookbehind once, regardless of how many different possible lengths it has.
Finally, flavors like JavaScript and Tcl do not support lookbehind at all,
even though they do support lookahead.
Lookaround Is Atomic
The fact that lookaround is zero-length automatically makes it atomic.
As soon as the lookaround condition is satisfied, the regex engine forgets about everything inside the lookaround.
It will not backtrack inside the lookaround to try different permutations.
The only situation in which this makes any difference is when you use capturing groups inside the lookaround.
Since the regex engine does not backtrack into the lookaround, it will not try different permutations of the capturing groups.
For this reason, the regex (?=(\d+))\w+\1 never matches 123x12.
First the lookaround captures 123 into \1.
\w+ then matches the whole string and backtracks until it matches only 1.
Finally, \w+ fails since \1 cannot be matched at any position.
Now, the regex engine has nothing to backtrack to, and the overall regex fails.
The backtracking steps created by \d+ have been discarded.
It never gets to the point where the lookahead captures only 12.
Obviously, the regex engine does try further positions in the string.
If we change the subject string, the regex (?=(\d+))\w+\1 does match 56x56 in 456x56.
If you don't use capturing groups inside lookaround, then all this doesn't matter.
Either the lookaround condition can be satisfied or it cannot be.
In how many ways it can be satisfied is irrelevant.
零宽断言用于查找在某些内容(但并不包括这些内容)之前或之后的东西,
断言用来声明一个应该为真的事实。正则表达式中只有当断言为真时才会继续进行匹配。
其中零宽断言又分四种
一、先行断言
例如 [a-z]*(?=ing) 可以匹配cooking singing 中的cook与sing
注意:先行断言的执行步骤是这样的先从要匹配的字符串中的最右端找到第一个ing
例如:.*(?=ing) 可以匹配cooking singing 中的cooking sing 而不是 cook
二、后发断言
例如(?<=abc).* 可以匹配abcdefg中的defg
注意:后发断言跟先行断言恰恰相反 它的执行步骤是这样的:
例如(?<=abc).* 可以匹配abcdefgabc中的defgabc 而不是abcdefg
三、负向零宽断言
负向零宽断言 (?!表达式) 也是匹配一个零宽度的位置,不过这个位置的“断言”取表达式的反值,
负向零宽后发断言(?<!表达式)
负向零宽先行断言 (?!表达式)
负向零宽断言要注意的跟正向的一样
今天遇到一个需求是要要验证密码的强度。密码必须包含下面数字+小写字母+大写字母且长度要求8-20位。
长度直接可以判断,但是数字+小写字母+大写字母这个就要用正则来解决了。
首先想到就是判断次分别用\d+,[a-z]+,[A-Z]来判断,这三个正则同时满足就是达到强度要求。
后来想到有没有用一个正则就能完成判断的,其实是可以的,
但是就是比较复杂,其中涉及到正则表达式中的断言和分组这些高级知识。
下面给出最终的正则表达式:
^(?=.*[0-9].*)(?=.*[A-Z].*)(?=.*[a-z].*).{8,20}$
正则表达式中的断言,作为高级应用出现,倒不是因为它有多难,
而是概念比较抽象,不容易理解而已,今天就让小菜通俗的讲解一下。
如果不用断言,以往用过的那些表达式,
仅仅能获取到有规律的字符串,而不能获取无规律的字符串。
举个例子,比如html源码中有
<title>xxx</title>
标签,用以前的知识,我们只能确定源码中 的
<title>和</title>
是固定不变的。因此,如果想获取页面标题(xxx),充其量只能写一个类似于这样的表达 式:
<title>.*</title>,而这样写匹配出来的是完整的<title>xxx< /title>标签,
并不是单纯的页面标题xxx。
想解决以上问题,就要用到断言知识。
在讲断言之前,读者应该先了解分组,这有助于理解断言。
分组在正则中用()表示,根据小菜理解,分组的作用有两个:
将某些规律看成是一组,然后进行组级别的重复,可以得到意想不到的效果。
分组之后,可以通过后向引用简化表达式。
先来看第一个作用,对于IP地址的匹配,简单的可以写为如下形式:
\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}
但仔细观察,我们可以发现一定的规律,可以把
.\d{1,3}
看成一个整体,也就是把他们看成一组,再把这个组重复3次即可。表达式如下:
\d{1,3}(.\d{1,3}){3}
这样一看,就比较简洁了。
再来看第二个作用,就拿匹配<title>xxx</title>标签来说,
简单的正则可以这样写:
<title>.*</title>
可以看出,上边表达式中有两个title,完全一样,其实可以通过分组简写。表达式如下:
<(title)>.*</\1>
这个例子实际上就是反向引用的实际应用。
对于分组而言,整个表达式永远算作第0组,在本例中,第0组是
<(title)>.*</\1>,然后从左到右,依次为分组编号,因此,(title)是第1组。
用\1这种语法,可以引用某组的文本内容,\1当然就是引用第1组的文本内容了,
这样一来,就可以简化正则表达式,只写一次title,把它放在组里,然后在后边引用即可。
以此为启发,我们可不可以简化刚刚的IP地址正则表达式呢?
原来的表达式为\d{1,3}(.\d{1,3}){3},里边的\d{1,3}重复了两次,如果利用后向引用简化,表达式如下:
(\d{1,3})(.\1){3}
简单的解释下,把\d{1,3}放在一组里,表示为(\d{1,3}),它是第1组,
(.\1)是第2组,在第2组里通过\1语法,后向引用了第1组的文本内容。
经过实际测试,会发现这样写是错误的,为什么呢?
小菜一直在强调,后向引用,引用的仅仅是文本内容,而不是正则表达式!
也就是说,组中的内容一旦匹配成功,后向引用,
引用的就是匹配成功后的内容,引用的是结果,而不是表达式。
因此,(\d{1,3})(.\1){3}这个表达式实际上匹配的是四个数都相同的IP地址,
比如:123.123.123.123。
至此,读者已经掌握了传说中的后向引用,就这么简单。
接下来说说什么是断言。
所谓断言,就是指明某个字符串前边或者后边,将会出现满足某种规律的字符串。
就拿文章开篇的例子来说,我们想要的是xxx,它没有规律,
但是它前边肯定会有<title>,后边肯定会有</title>,这就足够了。
想指定xxx前肯定会出现<title>,就用正后发断言,表达式:(?<=<title>).*
向指定xxx后边肯定会出现</title>,就用正先行断言,表达式:.*(?=</title>)
两个加在一起,就是(?<=<title>).*(?=</title>)这样就能匹配到xxx。
相信读者看到这,已经蒙了,不用急,待小菜慢慢讲来。
其实掌握了规律,
就很简单了,无论是先行还是后发,都是相对于xxx而言的,也就是相对于目标字符串而言。
假如目标字符串后边有条件,可以理解为目标字符串在前,就用先行断言,放在目标字符串之后。
假如目标字符串前边有条件,可以理解为目标字符串在后,就用后发断言,放在目标字符串之前。
假如指定满足某个条件,就是正。假如指定不满足某个条件,就是负。
断言只是条件,帮你找到真正需要的字符串,本身并不会匹配!
(?=XXX ) |
零宽度正先行断言。仅当子表达式 X 在 此位置的右侧匹配时才继续匹配。 |
(?!XXX) |
零宽度负先行断言。仅当子表达式 X 不在 此位置的右侧匹配时才继续匹配。 |
(?<=XXX) |
零宽度正后发断言。仅当子表达式 X 在 此位置的左侧匹配时才继续匹配。 |
(?<!XXX) |
零宽度负后发断言。仅当子表达式 X 不在此位置的左侧匹配时才继续匹配。 |
从断言的表达形式可以看出,它用的就是分组符号,只不过开头都加了一个问号,这个问号就是在说这是一个非捕获组,这个组没有编号,不能用来后向引用,只能当做断言。