正则表达式的性能？！

场景一

            Stopwatch w = new Stopwatch();
            w.Start();
            Match mtch = Regex.Match(str, @"(?<=<html.*?<head.*?)<(link|script).*?(?=</head>)", RegexOptions.IgnoreCase | RegexOptions.Singleline);
            w.Stop();

            Console.WriteLine(w.ElapsedMilliseconds);
            Console.ReadLine();

str 是一网上的任一个 HTML 文本。功能是找到 Html里面的 Head 里面的第一个 Link 或 Script 标签。 Jquery表述： $("html>head>link:first,html>head>script:first")

执行结果让我大吃一惊：结果是 337782 ＝ 337.782 秒 ≈ 5.6分种

是什么让它如此笨拙？

1. .*?

2. 分组

3. 两个零宽断言

把正则表达式改为： <html.*?<head.*?<(link|script).*?</head> ，结果是 4 Milliseconds。

我写的零宽断言有什么问题吗？

反复测试，发现， 在零宽断言里，必须不能包含 .*? 或 .* 之类的东西。必须的！！！

把正则分解，分成几步来解决。

1.先用 (?<=<head).*?(?=</head>) 取出 Head头里的内容。

2.再用 <(link|script) 找出第一个匹配。

OK。

场景二

//把 \u数字 转换为 字符
var reg = new Regex(@"\\u.{4}", RegexOptions.Compiled);
val = reg.Replace(val, new MatchEvaluator(match =>
{
    if (match.Success == false) return match.Value;

    return char.ConvertFromUtf32(Convert.ToInt32(match.Value.Slice(2), 16));
}));

对于简单的处理，使用循环，还是比较保险的。

万次处理字符串"123456789\\u0029123456789" ： 7770毫秒

如果使用循环，万次： 1毫秒。

posted @ 2009-09-25 09:18 NewSea 阅读(1920) 评论(0) 收藏举报

刷新页面返回顶部

♨ NewSea#'s Sky

大道至简，思想-体系-标准

正则表达式的性能？！

场景一

场景二