Regular expression这东西其实名不副实,因为其实regular expression看起来非常的不regular。看这个"regular" expression:
(?<="n)"s*(?<city>[^"n]+)"s*,"s*(?<country>"w+)"s+(?<pcode>.{3}"s*.{3}).*$
你能很快的明白它在干嘛吗?Regular expression的丑看似是个小问题,但对爱美的程序员来说是个大问题,因为我们相信美是代码质量之一--be beautiful, or be bugful。能够一眼看出结构,而不用在头脑中进行一番人肉parsing的代码,蕴含错误的可能性更小。在很多地方,regular expression担当着输入验证的重任,regular expression的问题可能导致系统vulnerability。第二个问题是复杂的regular expression基本上是write-only的,写好以后过一段时间连自己都不知道哪部分在干嘛了,修改起来很费尽,甚至到比较极端的情况,连第一次写都很费劲--这个地方应该有多少反括号?我得再把前面的部分扫描一遍。
基于以上原因,我尝试了一种更容易的表示regular expression的方法,我称为REally REgular REgex,缩写成Re{3}gex。这种写法主要是模仿S-expression,以及Haskell、Python等的indentation rule。玩过这些东西的同学应该很容易看明白例子:
1 namespace TestRe3gex {
2 using Re3gex;
3 using System;
4 class TestRe3gex {
5 public static void Main(string[] args)
6 {
7 string s = @"
8 # regular expression written in this way is so crystal clear
9 # that there is no need to comment on it
10 alter
11 one of
12 literal abc_d
13 u007f
14 any except range 2 5
15 sans
16 one of
17 literal z
18 any except range _ _
19 u0000 u007f
20 white-space char
21 *
22 dec digit
23 group
24 literal xkdkdk
25 _kdkd
26 n
27 ldll ldldl
28
29 any except
30 range 0 9
31 ";
32 Console.WriteLine(Re3gex.Parse(s));
33 }
34 }
35 }
以上程序运行的结果是[abc"u007fd[^2-5]-[z[^"u0000-"u007f]]]|"s|"d*|(?:xkdkdk"nkdkdldllldldl[^0-9])【严重说明:这个bsp自动把\转换成"了,所以上述字符串里的"实际上是\,以下同】,即变量s表示的标准.NET regular expression。
需要说明的是:
1.空格、回车可以随便添加,注释单独一行,以“#”开始,要想匹配空格、回车需用escape sequence。
2.Re{3}gex的escape sequence比较有特色:它不是以\开头然后接着在后面写,而是在需要escape character的地方写一个下划线_,然后在下面一行下划线的正下方接着写。比如:
literal ___abc
rnu002f
等于
\r\r\u002fabc。
3.完全没有用括号的地方了,实际上(和)在Re{3}gex里根本不是meta-symbol,子表达式和operator(如quantifier)的关系完全用缩进来表达。
4.很多regular expression用符号表示的元素改成用单词或缩写了,这是为了防止书写时出错和便于理解,比如\w用"word char"表示,\s用"white-space char"表示,\G用"follow prev"表示,(?<name> subexpression)用capture as name subexpression表示...但是为了快捷,少数常用符号保留了,包括*、+、?、{n,m}、^、$。
下面再来两个例子:
^
* any except literal <>
* capture + capture capture as Open literal <
* any except literal <>
+ capture balance Close - Open literal >
* any except literal <>
if capture Open
then lookahead not null
$
转换成
^[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!)|)$
这个regular expression用于检查<和>是否配对。
lookbehind is literal _
n
* white-space char
capture as city
+ any except literal _
n
* white-space char
literal ,
* white-space char
capture as country
+ word char
+ white-space char
capture as pcode
{3} any
* white-space char
{3} any
* any
$
转换成
(?<="n)"s*(?<city>[^"n]+)"s*,"s*(?<country>"w+)"s+(?<pcode>.{3}"s*.{3}).*$
这段表达式用于从地址中 提取城市、国家和邮编。
最后是比较详细的说明:
Re{3}gex specification
/**//*
Overview
========
1.Concepual structure in EBNF
<exp> ::=
<exp> <exp>
literal {<literal>}
<position>
<class>
<qualifier> <exp>
option <options> <exp>
<grouping>
<alternation>
null
2.Prefix notation sytle syntax.
3.Indention instead of parenthesis is used to express nesting relationship.
5.Double line style escape sequence
6.Empty lines are allowed
7.All subclause under one clause must be at the same indention level. For example
both
* literal a
any
and
*
literal a
any
are valid but
* literal a
any
is not
<literal>
=========
1.literal can either follow the "literal" keyword in the same line, or start on a new line indented
2.All Literals under one "literal" keyword are concatenated
3.White spaces in literals are not significant.
4.Character escapes and backreference constructs:
_
_ matches the underline character
_
s matches the space character
_
' matches the double quote
_
a matches a bell \u0007
_
b matches a backspace \u0008
_
t matches a tab \u0009
_
r matches a carriage return \u000D
_
v mateches a vertical tab \u000B
_
n matches a new line \u000A
_
e matches a escape \u001B
_
x?? (x followed by two hex digits) matches an ASCII character using hex representation
_
u???? (x followed by four hex digits) matches a Unicode character
_
=? translates to backreference \?, where ? is a digit
_
=??? translates to backreference \k<???>, where ??? is an identifier
Position assertions
====================
^
$
str beg = \A
str end before newline = \Z
str end = \z
follow prev = \G
word bounary = \b
not word bounary = \B
Quantifiers
===========
1. <qualifier> ::=
*
+
?
{n}
{n,}
{n,m}
and their lazy versions
2. Qualifiers are translated as is
3. If necessary, the subexpression of qualifier is parenthesized with (?: )
Character classes
=================
1.<class> ::=
any
one of {<concatenated class>}
any except {<concatenated class>}
range <char> - <char>
unicode <identifier>
not unicode <identifier>
word char
not word char
white-space char
not white-space char
dec digit
not dec digit
<concatenated class> ::=
literal <literal>
<class>
2.In "one of" and "any except" the following can be included
literals (without the "literal" keyword)
other character classes(must start on a new line, properly indented)
3."any" translates to "."
"range ? - ?" translates to "[?-?]"
"unicode " translates to "\p{}"
"not unicode " translates to "\P{}"
"word char" translates to "\w"
"not word char" translates to "\W"
"white-space char" translates to "\s"
"not white-space char" translates to "\S"
"dec digit" translates to "\d"
"not dec digit" translates to "\D"
4.A "sans" clause can follow a character class, followed by another character class, all at the same indention level, which are translated into character class subtraction expression.
5."[0-9a-fA-F]" =
one of range 0 9
range a f
range A F
"[^0-9a-fA-F]" =
any except range 0 9
range a f
range A F
Options
=======
<options>:
1. Options are:
"IgnoreCase"
"Multiline"
"ExplicitCapture"
"Singleline"
"-IgnoreCase"
"-Multiline"
"-ExplicitCapture"
"-Singleline"
2. Options are written on the same line as "option"
3. Options are seperated by spaces, prefix "-" means being turned of.
4. We don't need "IgnorePatternWhiteSpace"
5. With subexpressions, an option is translated to option grouping constructs (?imnsx-imnsx: ); without subexpressions to misc construct (?imnsx-imnsx).
Grouping constructs
===================
<grouping> ::=
capture <exp>
capture as <identifer> <exp>
group <exp>
balance <identifier> - <identifier>
lookahead is <exp>
lookahead not <exp>
lookbehind is <exp>
lookbehind not <exp>
no backtrack <exp>
"capture " translate to ()
"capture as name " translate to (?<name> )
"group " translate to (?: )
"balance name1 - name2 " translate to (?<name1-name2>)
"lookahead is " translate to (?= )
"lookahead not " translate to (?! )
"lookbehind is " translate to (?<= )
"lookbehind not " translate to (?<! )
"no backtrack " translate to (?> )
Alternations
=============
<alternation> ::=
alter {<exp>}
if <exp> then <exp> [else <exp>]
if capture <identifier> then <exp> [else <exp>]
1. "alter " translates to subexpressions seperated by |
2. "if
then
else " translates to (?()|)
3. "if capture name
then
else " translates to (?(name)|)
4. "then" and "else" clause must starts at a newline, on the same indention level as the "if" clause
null
====
1. Empty subexpression is generally meaningless. In order to use empty subexpression in lookahead/lookbehind construct, use "null".
2. Example: "lookahead not null" translates to "(?!)"
Comments
========
Lines beginning with # are ignored
*/
以及Re3gex.Re3gex.Parse的源码和编译出来的dll文件:https://files.cnblogs.com/yushih/Re3gex.zip
Updates:
Hack进去一个命名和引用子expression的试验:
let subexp1 = literal blahblah
*
ref subexp1
转化为标准.NET expression:
(?:blahblah)*
新增支持内嵌Generalized Nondeterministic Finite Automata(以下简称GNFA)定义,Re3gex.Parse可以将这个GNFA转化成.NET regex。这个功能有什么用呢?地球人都知道,写一个regex识别C的注释/* ... */是很困难的,定义一个DFA识别它却很容易。但是如果手写一个DFA,还要写一段驱动DFA的代码,麻烦得很。因此比较好的方法是直接定义DFA,然后用.NET的regex引擎来运行这个DFA。不过在这中间还得有个步骤,就是把DFA转化成regex,DFA是GNFA的子集,因此可以用re{3}gex的内嵌GNFA达到目的。例如:
1 namespace RemoveComments {
2 using Re3gex;
3 using System;
4 using System.Text.RegularExpressions;
5
6 class RemoveComments {
7 public static void Main(string[] args)
8 {
9 string comment_pattern = @"
10 GNFA start=1
11 accept=5
12 1->2 literal /
13 2->3 literal *
14 3->4 literal *
15 ->3 any except literal *
16 4->5 literal /
17 ->4 literal *
18 ->3 any except literal *
19 literal /
20 #state 5 has no outgoing transition
21 ";
22 string str_with_comments =
23 "blah1 /**comment 1**/ " +
24 "blah2 /*////comment2///*/ " +
25 "blah3 /**/ /*/*/ " +
26 "//**// = // " +
27 "this is not comment: /*/";
28
29 Console.WriteLine
30 (Regex.Replace
31 (str_with_comments,
32 Re3gex.Parse(comment_pattern),
33 ""));
34 }
35 }
36 }
运行的结果是blah1 blah2 blah3 // = // this is not comment: /*/。解释一下:
10~11行:内嵌GNFA定义以“GNFA”开头,然后用“start=<状态名>”指明开始状态,用“accept=<状态1> <状态2> ...”指明接受状态。状态名可以用数字、字母和下划线。
12行:状态“1”的定义, 状态“1”有一个到状态“2”的transition,这个transition的输入是“/”。
13行:与12行同理。
14行、15行:状态“3”有两个transition,一个接收“*”然后转到态“4”,一个接受除“*”外任何一个字符([^"*])然后转回态“3”。
16~19行:状态“4”有三个transition,分别接受“/”,“*”,和除“/”、“*”外任何字符,分别转到态“5”、“4”和“3”。