[2019.04.22]正则表达式的失控——回溯循环
今天做练习的时候觉得题干给出的正则表达式不能正确检测一些非法的电邮地址,所以打算自己动手写一个。在做测试的时候,写出了全新的BUG,并且由此开启了对正则表达式及其匹配引擎的一些了解。
1.什么是回溯循环
引用regular-expression.info的说法,我稍作翻译来解释这个现象。
REF: https://www.regular-expressions.info/catastrophic.html
Consider the regular expression (x+x+)+y. Before you scream in horror and say this contrived example should be written as xx+y to match exactly the same without those terribly nested quantifiers: just assume that each "x" represents something more complex, with certain strings being matched by both "x". See the section on HTML files below for a real example.
思考一下下面的例子:(x+ x+ )+ y。在你大吼着惊叫出“这个诡异的表达式应该改为(x x+ y)来舍弃掉这些可怕的嵌套的表达式并完成相同的功能”之前,让我们假设“x”代表更复杂的表达式,而且某些字符串被这两个“x”反复匹配。下面是一个例子:
Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer explodes. Almost.
当你将上面的表达式应用于字符串 “ xxxxxxxxxxy ”的时候:
第一次匹配尝试:第一个的“x+”匹配了全部的10个x,所以第二个x+匹配不到任何东西(+匹配≥1次所以匹配失败)。本次匹配失败,引擎回溯到字符串开头;
第二次匹配尝试:第一个的“x+”匹配了9个x,第二个x+匹配到1个x,第一次匹配组尝试成功,开始尝试匹配第二组,并且失败。(由于 + 要求匹配一次或者多次)一次组匹配算作匹配成功,所以第一部分的匹配完成,剩下的y也被y匹配,整个表达式匹配成功并且交付给了你的客户,然后他电脑炸了。
大概就是这样。
The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails. Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to (6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.
上面的表达式在最后面的 y 缺失的时候,会便显得非常糟糕。当y没能被匹配的时候,表达式引擎回溯了:
注:(a,b)表示第一个x+匹配了a个x,第二个x+匹配了b个x。x[1] 表示第一个x+,而x[2]表示第二个x+
第一次匹配尝试:(9,1),匹配不到y,引擎回溯,此时x[2]只匹配了1个x不能提供回溯,x[1]可以提供一个x来进行回溯;
第二次匹配尝试:(8,2),匹配不到y,引擎回溯,此时x[2]匹配了2个x,可以提供回溯,于是x[2]提供一个x来进行回溯;
第三次匹配尝试:(9,1),匹配不到y,引擎回溯,x[1]和x[2]都提供过1个x,这次x[2]提供2个x来回溯;
第四次匹配尝试:(7,1),发现可以匹配第二组(1,1),结果匹配不到y,引擎回溯
后续:(6,4); (6,2)(1,1); (6,1)(2,1); (6,1)(1,2)……现在我想你明白发生什么了1吧。
If you try this regex on a 10x string in RegexBuddy's debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.
如果你用10个x和1个y的字符串来RegexBuddy's debugger网站做测试,你会发现一共用了2558次匹配引擎才能发现y缺失了。如果是11个x,它需要5118步;12个则是10238步。很显然这是一个O(2n)复杂度的情形。当x数量达到21个的时候,引擎在了280万步之后发现了灾难性的回溯问题,死掉了2。
RegexBuddy is forgiving in that it detects it's going in circles, and aborts the match attempt. Other regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version 5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.
RegexBuddy 会自己发现死循环并放弃匹配尝试,可是有些表达式引擎(比如.NET)会一直试下去,或者直接因为栈溢出崩溃掉(比如5.10版本以前的Perl)。在Win上发生堆栈溢出是很烦人的,因为你的应用程序会突然“无缘无故没有理由地”闪退。所以在运营使用自定义正则表达式的Web服务的时候一定要小心,新手有惊人的写出指数级别的复杂表达式的能力(译者按:说的是我)。
1get the drift -- Understand the general meaning or purport
2Bow out -- To leave a job or stop doing an activity, usually after a long time
根据上述内容,我们可以得出一些结论:
1.正则表达式是回溯式判定的:当包含“+”这种“to unlimitied times”的符号时,一次判定到字符串结尾,表达式却没有判定完,引擎会回溯并减少先前的判定次数,重试判定。
2.没有长时间循环限制的引擎会一直尝试下去直到所有的判定组合都被尝试(暴力测试啊!)
所以为了让自己写的表达式不要被奇怪的输入玩坏,童鞋们一定要:
1.消除掉表达式中的嵌套量词,可选项不要用+这种东西:比如([\w]+[\w.-]+) -->([\w][\w.-]+),这尤其重要!
使用一个只匹配一次的量词可以避免灾难性的回溯
2.[我还没研究明白dalao想表达什么]
一个实际的案例(其实就是我写的Bug):匹配电邮地址
Regex: ([A-Za-z0-9]+([\w.-]?[A-Za-z0-9]+)*)+@([\w-]+\.)+([A-Za-z0-9]+)
Testcase:ADU5dX-532.dx_aerx@gcore..com
灾难性回溯发生在匹配完第一个“.”之后,由于 ([\w.-]?[A-Za-z0-9]+)*) 和 ([\w-]+\.) 都含有第二次匹配需要的“.”,所以引擎反复在这两组匹配中回溯,在977945次尝试后,系统中断了尝试。
更改方法:少用几个“*”、“+”和“()”就能极大的减少匹配尝试的次数。正则表达式里面的括号是有自己的含义的,不能随意添加或删减,会影响到匹配结果或是尝试次数!
更正结果:([A-Za-z0-9]([\w.-]?[A-Za-z0-9])+)*)@([\w-]+\.)+([A-Za-z0-9]+)