.NET Regular Expressions: Regex and Balanced Matching (转)

< DOCTYPE html PUBLIC -WCDTD XHTML StrictEN httpwwwworgTRxhtmlDTDxhtml-strictdtd>

One of the questions that seems to come up a lot is that someone wants to match balanced parenthesis. Something like the string “(aa (bbb) (bbb) aa)” and they want to match from the beginning parenthesis to the matching end parenthesis. Generally this is not possible with regular expression, that language just is not descriptive enough to handle this. For the longest time this is how I answered these question when they came to me.

However in .Net this is actually possible with something called Balancing Group Definition. This construct generally looks like (?<name1-name2>). The following is what MSDN has to say about this:

Balancing group definition. Deletes the definition of the previously defined group name2 and stores in group name1 the interval between the previously defined name2 group and the current group. If no group name2 is defined, the match backtracks. Because deleting the last definition of name2 reveals the previous definition of name2, this construct allows the stack of captures for group name2 to be used as a counter for keeping track of nested constructs such as parentheses. In this construct, name1 is optional. You can use single quotes instead of angle brackets; for example, (?'name1-name2').

The following expression matches all balanced opening and closing angle brackets(<>). Angle brackets were used because they do no require escaping like parenthesis and make the expression a little easier to read:

            <

            [^<>]*

            (

                        (

                                    (?<Open><)

                                    [^<>]*

                        )+

                        (

                                    (?<Close-Open>>)

                                    [^<>]*

                        )+

            )*

            (?(Open)(?!))

>

The outer most group just matches an open angle bracket followed by anything that is not a angle bracket followed by close angle bracket. I will explain “(?(Open)(?!))” later.

The inner group does all of the interesting angle bracket matching. The Open group matches only the open angle bracket and the following part of expression matches anything that is not an angle bracket. So the first group will basically match anything up till the first close angle bracket.

It is best to think of a Group as a Stack of captures. Where the top of the stack is the last capture made. (?<Close-Open>\)) Matches to “)” and pops a capture off of the Open group’s capture stack. This match can only be successful if and only if the Open group’s capture stack is not empty. This is a fancy way of saying that for every match of this group there must be a match of the group Open.

So now we know that for every closing angle bracket there must have been an opening angle bracket. However we still have done nothing to assert that for every opening angle bracket there is a matching closing angle bracket. That is where the (?(Open)(?!)) part of the expression comes into play. This expression tells Regex to match (?!) if the Open group still contains a match(i.e. there were more open angle brackets then close angle brackets). Trying to match (?!) will always cause the expression to fail. Basically this is a way of making the expression fail if the Open group still contains a capture.

posted @ 2008-10-21 10:01  netcorner  阅读(173)  评论(0编辑  收藏  举报