正则表达式入门

转载请注明：http://www.cnblogs.com/zhangjiankun/archive/2013/05/23/3095989.html

主流操作系统（*nix[Linux, Unix等]、Windows、HP、BeOS等）、主流的开发语言（PHP、C#、Java、C++、VB、Javascript、Ruby以及python等）、数以亿万计的各种应用软件（UE、Notepad++、SI等等）中，都可以看到正则表达式的影子。

Get hold of regular expression can definitly improve your work.

Windows/Dos下有用于文件查找的通配符 (wildcard)，也就是*和?。和通配符类似，正则表达式也是用来进行文本匹配的工具，只不过比起通配符，它能更精确地描述你的需求——当然，代价就是更复杂——比如你可以编写一个正则表达式，用来查找所有以0开头，后面跟着2-3个数字，然后是一个连字号“-”，最后是7或8位数字的字符串(像010-12345678或 0376-7654321)。 ---（http://www.cnblogs.com/deerchao/archive/2006/08/24 /zhengzhe30fengzhongjiaocheng.html）
目前的主流正则引擎又分为3类：一、DFA，二、传统型NFA，三、POSIX NFA。

举例简单说明NFA与DFA工作的区别：（http://baike.baidu.com/link?url=Ix4u2zKRIMSOrFBtU2PEHVASTnnvMr_PUmxAWHKsxWVfK7Yh-3zL6nzZJaL17AZl）

比如有字符串this is yansen’s blog，正则表达式为 /ya(msen|nsen|nsem)/ (不要在乎表达式怎么样，这里只是为了说明引擎间的工作区别)。

NFA工作方式如下，先在字符串中查找 y 然后匹配其后是否为 a ，如果是 a 则继续，查找其后是否为 m 如果不是则匹配其后是否为 n (此时淘汰msen选择支)。然后继续看其后是否依次为 s,e，接着测试是否为 n ，是 n 则匹配成功，不是则测试是否为 m 。为什么是 m ？因为 NFA 工作方式是以正则表达式为标准，反复测试字符串，这样同样一个字符串有可能被反复测试了很多次！传统的 NFA 接受它找到的第一个匹配，所以它还可能会导致其他（可能更长）匹配未被发现。

而DFA则不是如此，DFA会从 this 中 t 开始依次查找 y，定位到 y ，已知其后为 a ，则查看表达式是否有 a ，此处正好有 a 。然后字符串 a 后为 n ，DFA依次测试表达式，此时 msen 不符合要求淘汰。nsen 和 nsem 符合要求，然后DFA依次检查字符串，检测到sen 中的 n 时只有nsen 分支符合，则匹配成功！

POSIX NFA 引擎与传统的 NFA 引擎类似，不同的一点在于：在它们可以确保已找到了可能的最长的匹配之前，它们将继续回溯。因此，POSIX NFA 引擎的速度慢于传统的 NFA 引擎；并且在使用 POSIX NFA 时，您恐怕不会愿意在更改回溯搜索的顺序的情况下来支持较短的匹配搜索，而非较长的匹配搜索。

一般而论，DFA引擎则搜索更快一些！但是NFA以表达式为主导，反而更容易操纵，因此一般程序员更偏爱NFA引擎！两种引擎各有所长，而真正的引用则取决与你的需要以及所使用的语言！

目前使用DFA引擎的程序主要有：awk,egrep,flex,lex,MySQL,Procmail等；
使用传统型NFA引擎的程序主要有：GNU Emacs,Java,ergp,less,more,.NET语言,PCRE library,Perl,PHP,Python,Ruby,sed,vi；

使用POSIX NFA引擎的程序主要有：mawk,Mortice Kern Systems’ utilities,GNU Emacs(使用时可以明确指定)；

下面摘自wikipedia对POSIX NFA引擎规则的介绍，随便复习一下英文吧，学会下面几张表格中列出的元字符含义，那么正则表达式算入门了。POSIX正则表达式主要分为：BRE(Basic Regular Expression)和ERE(Extended Regular Expressions)，学会这两个的区别。

Standards

The IEEE POSIX standard has three sets of compliance: BRE,ERE, and SRE for Basic, Extended, and Simple Regular Expressions. SRE is deprecated,in favor of BRE, as both provide backward compatibility. The subsection below covering the character classes applies to both BRE and ERE.

BRE and ERE work together. ERE adds ?, +,and |, and it removes the need to escape the metacharacters ( ) and { }, which are required in BRE. Furthermore, as long as the POSIX standard syntax for regular expressions is adhered to, there can be, and often is, additional syntax to serve specific (yet POSIX compliant) applications. Although POSIX.2 leaves some implementation specifics undefined, BRE and ERE provide a "standard" which has since been adopted as the default syntax of many tools, where the choice of BRE or ERE modes is usually a supported option. For example, GNU grep has the following options: "grep -E" for ERE, and "grep -G" for BRE (the default), and "grep -P" for Perl regular expressions.

POSIX basic and extended

In the POSIX standard, Basic Regular Syntax, BRE, requires that the metacharacters ( ) and { } be designated  and \{\}, whereas Extended Regular Syntax, ERE, does not.

Metacharacter	Description
`.`	Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, `a.c` matches "abc", etc., but `[a.c]` matches only "a", ".", or "c".
`[ ]`	A bracket expression. Matches a single character that is contained within the brackets. For example, `[abc]` matches "a", "b", or "c". `[a-z]` specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: `[abcx-z]` matches "a", "b", "c", "x", "y", or "z", as does `[a-cx-z]`. The `-` character is treated as a literal character if it is the last or the first (after the `^`) character within the brackets: `[abc-]`, `[-abc]`. Note that backslash escapes are not allowed. The `]` character can be included in a bracket expression if it is the first (after the `^`) character: `[]abc]`.
`[^ ]`	Matches a single character that is not contained within the brackets. For example, `[^abc]` matches any character other than "a", "b", or "c". `[^a-z]` matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.
`^`	Matches the starting position within the string. In line-based tools, it matches the starting position of any line.
`$`	Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.
BRE: `` ERE: `( )`	Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, `\n`). A marked subexpression is also called a block or capturing group.
`\n`	Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX ERE syntax. Some tools allow referencing more than nine capturing groups.
`*`	Matches the preceding element zero or more times. For example, `abc` matches "ac", "abc", "abbbc", etc. `[xyz]` matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. `(ab)*` matches "", "ab", "abab", "ababab", and so on.
BRE: `\{m,n\}` ERE: `{m,n}`	Matches the preceding element at least m and not more than n times. For example, `a{3,5}` matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regular expressions.

Examples:

.at matches any three-character string ending with "at", including "hat", "cat", and "bat".
[hc]at matches "hat" and "cat".
[^b]at matches all strings matched by .at except "bat".
[^hc]at matches all strings matched by .at other than "hat" and "cat".
^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
[hc]at$ matches "hat" and "cat", but only at the end of the string or line.
\[.\] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".

POSIX Extended Regular Expressions

The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax. With this syntax, a backslash causes the metacharacter to be treated as a literal character. So, for example,  is now ( ) and \{ \} is now { }. Additionally, support is removed for \n backreferences and the following metacharacters are added:

Metacharacter	Description
`?`	Matches the preceding element zero or one time. For example, `ba?` matches "b" or "ba".
`+`	Matches the preceding element one or more times. For example, `ba+` matches "ba", "baa", "baaa", and so on.
`\|`	The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, `abc\|def` matches "abc" or "def".

Examples:

[hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at".
[hc]?at matches "hat", "cat", and "at".
[hc]*at matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", "at", and so on.
cat|dog matches "cat" or "dog".

POSIX Extended Regular Expressions can often be used with modern Unix utilities by including the command line flag -E.

POSIX character classes

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc...zABC...Z, while in some others as aAbBcC...zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX	Non-standard	Perl	Vim	ASCII	Description
`[:alnum:]`				`[A-Za-z0-9]`	Alphanumeric characters
	`[:word:]`	`\w`	`\w`	`[A-Za-z0-9_]`	Alphanumeric characters plus "_"
		`\W`	`\W`	`[^A-Za-z0-9_]`	Non-word characters
`[:alpha:]`			`\a`	`[A-Za-z]`	Alphabetic characters
`[:blank:]`				`[ \t]`	Space and tab
		`\b`	`\< \>`	`(?<=\W)(?=\w)\|(?<=\w)(?=\W)`	Word boundaries
`[:cntrl:]`				`[\x00-\x1F\x7F]`	Control characters
`[:digit:]`		`\d`	`\d`	`[0-9]`	Digits
		`\D`	`\D`	`[^0-9]`	Non-digits
`[:graph:]`				`[\x21-\x7E]`	Visible characters
`[:lower:]`			`\l`	`[a-z]`	Lowercase letters
`[:print:]`			`\p`	`[\x20-\x7E]`	Visible characters and the space character
`[:punct:]`				[\]\[!"#$%&'()*+,./:;<=>?@\^_`{\|}~-]	Punctuation characters
`[:space:]`		`\s`	`\s`	`[ \t\r\n\v\f]`	Whitespace characters
		`\S`	`\S`	`[^ \t\r\n\v\f]`	Non-whitespace characters
`[:upper:]`			`\u`	`[A-Z]`	Uppercase letters
`[:xdigit:]`			`\x`	`[A-Fa-f0-9]`	Hexadecimal digits

POSIX character classes can only be used within bracket expressions. For example, [[:upper:]ab] matches the uppercase letters and lowercase "a" and "b".

An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor Vim further distinguishes word and word-headclasses (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

Note that what the POSIX regular expression standards call character classes are commonly referred to as POSIX character classes in other regular expression flavors which support them. With most other regular expression flavors, the term character class is used to describe what POSIX callsbracket expressions.

正则表达式进阶

perl 正则表达式集已经成为事实上的标准了。

Perl regular expressions have become the de facto standard, having a rich and powerful set of atomic expressions. Perl has no "basic" "extended" level, where the ( ) and { } may or may not have literal meanings. They are always metacharacters, as they are in "extended" mode for POSIX. To get their literal meaning, you escape them. Other metacharacters are known to be literal or symbolic based on context alone. Perl offers much more functionality: "lazy" regular expressions, backtracking, named capture groups, and recursive patterns, all of which are powerful additions to POSIX BRE/ERE.

Standard Perl

The Perl standard is still evolving in Perl 6, but the current set of symbols and syntax has become the de facto standard.

Largely because of its expressive power, many other utilities and programming languages have adopted syntax similar to Perl's — for example, Java,JavaScript, Python, Ruby, Microsoft's .NET Framework, and the W3C's XML Schema all use regular expression syntax similar to Perl's. Some languages and tools such as Boost and PHP support multiple regular expression flavors. Perl-derivative regular expression implementations are not identical, and all implement no more than a subset of Perl's features, usually those of Perl 5.0, released in 1994. With Perl 5.10, this process has come full circle with Perl incorporating syntactic extensions originally developed in PCRE and Python.

Lazy quantification

Quantifiers match as many times as possible unless followed by ?, when they match as few times as possible. We say quantifiers are greedy. For example, consider the string

Another whale sighting occurred on <January 26>, <2004>.

To match (then display) only "<January 26>" and not ", <2004>" it is tempting to write <.*>. But there is more than one >, and the expression can take the second one, and having both, still match, displaying "<January 26>, <2004>". Because the * quantifier is greedy, it will consume as many characters as possible from the string, and "<January 26>, <2004>" has more characters than "<January 26>".

This problem can be avoided by specifying the text that is not to be matched: <[^>]*>), but modern regular expressions allow a quantifier to be specified as lazy. They put a question mark after the quantifier to make it lazy <.*?>). By using a lazy quantifier, the expression tries the minimal match first. Lazy matching may also be used to improve performance, because greedy matching requires more backtracking.

分组（http://www.cnblogs.com/deerchao/archive/2006/08/24/zhengzhe30fengzhongjiaocheng.html#grouping）

我们已经提到了怎么重复单个字符（直接在字符后面加上限定符就行了）；但如果想要重复多个字符又该怎么办？你可以用小括号来指定子表达式(也叫做分组)，然后你就可以指定这个子表达式的重复次数了，你也可以对子表达式进行其它一些操作(后面会有介绍)。

(\d{1,3}\.){3}\d{1,3}是一个简单的IP地址匹配表达式。要理解这个表达式，请按下列顺序分析它：\d{1,3}匹配1到3位的数字，(\d{1,3}\.){3}匹配三位数字加上一个英文句号(这个整体也就是这个分组)重复3次，最后再加上一个一到三位的数字(\d{1,3})。

不幸的是，它也将匹配256.300.888.999这种不可能存在的IP地址。如果能使用算术比较的话，或许能简单地解决这个问题，但是正则表达式中并不提供关于数学的任何功能，所以只能使用冗长的分组，选择，字符类来描述一个正确的IP地址：((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?)。

理解这个表达式的关键是理解2[0-4]\d|25[0-5]|[01]?\d\d?，这里我就不细说了，你自己应该能分析得出来它的意义。

使用小括号指定一个子表达式后，匹配这个子表达式的文本(也就是此分组捕获的内容)可以在表达式或其它程序中作进一步的处理。默认情况下，每个分组会自动拥有一个组号，规则是：从左向右，以分组的左括号为标志，第一个出现的分组的组号为1，第二个为2，以此类推。

后向引用用于重复搜索前面某个分组匹配的文本。例如，\1代表分组1匹配的文本。难以理解？请看示例：

\b(\w+)\b\s+\1\b可以用来匹配重复的单词，像go go, 或者kitty kitty。这个表达式首先是一个单词，也就是单词开始处和结束处之间的多于一个的字母或数字(\b(\w+)\b)，这个单词会被捕获到编号为1的分组中，然后是1个或几个空白符(\s+)，最后是分组1中捕获的内容（也就是前面匹配的那个单词）(\1)。

你也可以自己指定子表达式的组名。要指定一个子表达式的组名，请使用这样的语法：(?<Word>\w+)(或者把尖括号换成'也行：(?'Word'\w+)),这样就把\w+的组名指定为Word了。要反向引用这个分组捕获的内容，你可以使用\k<Word>,所以上一个例子也可以写成这样：\b(?<Word>\w+)\b\s+\k<Word>\b。

使用小括号的时候，还有很多特定用途的语法。下面列出了最常用的一些：

常用分组语法
分类	代码/语法	说明
捕获	(exp)	匹配exp,并捕获文本到自动命名的组里
	(?<name>exp)	匹配exp,并捕获文本到名称为name的组里，也可以写成(?'name'exp)
	(?:exp)	匹配exp,不捕获匹配的文本，也不给此分组分配组号
零宽断言	(?=exp)	匹配exp前面的位置
	(?<=exp)	匹配exp后面的位置
	(?!exp)	匹配后面跟的不是exp的位置
	(?<!exp)	匹配前面不是exp的位置
注释	(?#comment)	这种类型的分组不对正则表达式的处理产生任何影响，用于提供注释让人阅读

我们已经讨论了前两种语法。第三个(?:exp)不会改变正则表达式的处理方式，只是这样的组匹配的内容不会像前两种那样被捕获到某个组里面，也不会拥有组号。“我为什么会想要这样做？”——好问题，你觉得为什么呢？

位置匹配

递归匹配

posted @ 2013-05-23 23:17 kunzj 阅读(521) 评论(0) 编辑收藏举报

刷新页面返回顶部