大家来找错-自己写个正则引擎(四)-解析方法工厂及单元测试

  解析方法工厂是一个静态类,用于生成各种子模式的解析方法,这是正则引擎的核心部分,也是比较用技巧的部分,我们一种模式一种模式来分析。

 

最简单的匹配就是单纯字符串的匹配,我们用string.Compare就可搞定,如下

//abc
public static ParseFunc MaxMatch(string str) {
    ParseFunc func = (string input, int index, out string output)
    => {
        bool result = false;
        output = string.Empty;
        if (string.Compare(input, index, str, 0, str.Length) == 0) {
            result = true;
            output = str;
        }
        return result;
    };
    return func;
}

 

再稍微复杂点儿的就是字符串的闭包,就是这个字符串可以出现多次,我们用一个do循环外加三个辅助的变量start,matchFaild,willOutput就可实现。start用于表示开始匹配的索引,matchFaild表示每次向后匹配是否匹配失败,如过匹配失败则退出循环,如果匹配成功,则index增加一个字符串的长度,意思就是扫描指针向后移动指定当度,willOutput保存最终匹配的字符串。

//(ab)*
public static ParseFunc OneOrMoreMaxMatch(string str) {
    ParseFunc func = (string input, int index, out string output)
    => {
        int start = index;
        bool matchFaild = false;
        string willOutput = string.Empty;

        do {
            matchFaild = false;
            if (string.Compare(input, index, str, 0, str.Length) == 0) {
                index += str.Length;
            }
            else {
                matchFaild = true;
            }

            if (!matchFaild) {
                willOutput = input.Substring(start, index - start);
            }
        } while (!matchFaild);

        output = willOutput;
        if (string.IsNullOrEmpty(willOutput)) {
            return false;
        }
        else {
            return true;
        }
    };
    return func;
}

接下来是或关系的处理,前面也说过了,一个或关系的子模式组,我们要取其最长的匹配,所以这里引入一个maxMatch的中间变量,用来存储每次匹配中产生的最大的匹配输出,matched表示该次尝试的子模式是否匹配成功,如匹配成功,则取出最长的匹配返回。

//ab|abc
public static ParseFunc MatchOr(IList<RegexNode> nodes) {
    ParseFunc func = (string input, int index, out string output)
      => {
        int start = index;
        bool matched = false;
        string maxMatch = string.Empty;
        matched = false;

        foreach (var n in nodes) {
            string tempOutput;
            if (n.Parse(input, index, out tempOutput)) {
                matched = true;
                if (tempOutput.Length > maxMatch.Length) {
                    maxMatch = tempOutput;
                }
            }
        }
        if (matched) {
            index += maxMatch.Length;
        }

        output = input.Substring(start, index - start);
        if (string.IsNullOrEmpty(output)) {
            return false;
        }
        else {
            return true;
        }
    };
    return func;
}

再稍微复杂一些的是或关系的闭包,闭包的处理基本上都有一个do循环,而或关系匹配是一个foreache循环,所以这里有一个嵌套的循环,循环中使用的变量还是matched和maxMatch。
//(ab|abc)*
public static ParseFunc MatchOneOrMoreWithOr(IList<RegexNode> nodes) {
    ParseFunc func = (string input, int index, out string output)
        => {
        int start = index;
        bool matched = false;
        string maxMatch = string.Empty;

        do {
            matched = false;
            foreach (var n in nodes) {
                string tempOutput;
                if (n.Parse(input, index, out tempOutput)) {
                    matched = true;
                    if (tempOutput.Length > maxMatch.Length) {
                        maxMatch = tempOutput;
                    }
                }
            }
            if (matched) {
                index += maxMatch.Length;
            }
        } while (matched);

        output = input.Substring(start, index - start);
        if (string.IsNullOrEmpty(output)) {
            return false;
        }
        else {
            return true;
        }
    };
    return func;
}

连接关系的处理也是处理多个子模式,但多个子模式的匹配要连起来也能匹配,所以在循环的时候如果某个子模式匹配失败要退出循环,这里的循环里用到了break。如果没有匹配失败的话,就根据start和index算出匹配结果。

//a(b|c)
internal static ParseFunc MatchAnd(List<RegexNode> nodes) {
    ParseFunc func = (string input, int index, out string output)
      => {
        int start = index;
        bool matchFaild = false;
        output = string.Empty;

        foreach (var n in nodes) {
            string tempOutput;
            if (n.Parse(input, index, out tempOutput)) {
                index += tempOutput.Length;
            }
            else {
                matchFaild = true;
                break;
            }
        }
        if (!matchFaild) {
            output = input.Substring(start, index - start);
        }

        if (string.IsNullOrEmpty(output)) {
            return false;
        }
        else {
            return true;
        }
    };
    return func;
}

理所当然,连接关系也有闭包情况,也是do循环包着foreach循环,有了前几次闭包处理的经验,该函数就不用太多做解释了。

//(a(b|c))*
internal static ParseFunc MatchOneOrMoreWithAnd(List<RegexNode> nodes) {
    ParseFunc func = (string input, int index, out string output)
        => {
        int start = index;
        bool matchFaild = false;
        string maxMatch = string.Empty;
        string willOutput = string.Empty;

        do {
            matchFaild = false;
            foreach (var n in nodes) {
                string tempOutput;
                if (n.Parse(input, index, out tempOutput)) {
                    index += tempOutput.Length;
                }
                else {
                    matchFaild = true;
                    break;
                }
            }
            if (!matchFaild) {
                willOutput = input.Substring(start, index - start);
            }
        } while (!matchFaild);

        output = willOutput;
        if (string.IsNullOrEmpty(willOutput)) {
            return false;
        }
        else {
            return true;
        }
    };
    return func;
}

好了,单元测试吧,先写个辅助函数

class UnitTest {
    public static bool Test(string pattern, string input, string expectedOutPut) {
        PatternNode pattNode = PatternParser.Parse(pattern);
        RegexNode parseNode = RegexParser.GetParseNode(pattNode);
        string[] words = parseNode.SplitWords(input);
        return expectedOutPut == string.Join("-", words);
    }
}

分别测试之前设计好的测试用例

Console.WriteLine( UnitTest.Test("a","aabaab","a-a-b-a-a-b"));
Console.WriteLine(UnitTest.Test("ab", "aabaab", "a-ab-a-ab"));
Console.WriteLine(UnitTest.Test("(a)*", "aabaab", "aa-b-aa-b"));
Console.WriteLine(UnitTest.Test("(a|b)*", "aabbaxaabbb", "aabba-x-aabbb"));
Console.WriteLine(UnitTest.Test("ab|ac", "abaaac", "ab-a-a-ac"));
Console.WriteLine(UnitTest.Test("a(b|c)", "abaaac", "ab-a-a-ac"));
Console.WriteLine(UnitTest.Test("a(b|c)*", "abbaaaccab", "abb-a-a-acc-ab"));

均返回true,说明功能实现了,但单元测试不是万能的,有些错误是测试不到的,所以更多的BUG还得用肉眼才能看出来。

posted @ 2010-05-30 19:49  蛙蛙王子  Views(867)  Comments(0Edit  收藏  举报