sunday算法c#实现 (Boyer-Moore-Horspool-Sunday string search algorithm)

因正则表达式搜索总是出现死循环,开始考虑改为其他搜索方式,因为.net自带的IndexOf默认只能找到第一个或最后一个,如果要把全部的匹配项都找出来,还需要自己写循环SubString,所以想找下有没有现成的,就发现了在这个领域里,BM算法是王道,而sunday算法据说是目前最好的改进版,这一点我没有从国外的网站尤其是wiki上找到印证,但中文谈论sunday的文章很多,我就姑且认为它是最好的吧。

 

这篇的图文很清晰的描述了算法过程

http://www.cnblogs.com/lbsong/archive/2012/05/25/2518188.html

 

可惜文中给的代码有的缺陷,如果找不到匹配会报索引越界。我改成这样:

 

 1         public static int IndexOf(string text, string pattern)
 2         {
 3             return IndexOf(text, pattern, 0, text.Length);
 4         }
 5 
 6         public static int IndexOf(string text, string pattern, int startPosition, int count)
 7         {
 8             if (startPosition < 0) startPosition = 0;
 9             if (startPosition >= text.Length) return -1;
10             int endPosition = startPosition + count;
11             if (endPosition < 0) return -1;
12             if (endPosition > text.Length) endPosition = text.Length;
13             if (pattern.Length > endPosition - startPosition) return -1;
14 
15 
16             int i = startPosition;
17             int j = 0;
18             int m = i + pattern.Length;
19 
20             int matchPosition = i;
21 
22             while (i < text.Length && j < pattern.Length)
23             {
24                 if (text[i] == pattern[j])
25                 {
26                     i++;
27                     j++;
28                 }
29                 else
30                 {
31                     if (m == endPosition)
32                     {
33                         i = text.Length + 1;
34                         break;
35                     }
36                     m = i + pattern.Length;
37 
38                     int k = pattern.Length - 1;
39                     while (k >= 0 && text[m] != pattern[k])
40                     {
41                         k--;
42                     }
43                     int gap = pattern.Length - k;
44                     i += gap;
45                     m = i + pattern.Length;
46                     if (m > endPosition) m = endPosition;
47                     matchPosition = i;
48                     j = 0;
49                 }
50             }
51 
52             if (i <= text.Length)
53             {
54                 return matchPosition;
55             }
56 
57             return -1;
58         }

 

 

好了,现在测试下性能:

 1 public static void PerformanceTest()
 2         {
 3             StreamReader reader = new StreamReader("D:\\LogConfiguration.xml", Encoding.ASCII);
 4             string context = reader.ReadToEnd();
 5             string pattern = "xxxx";
 6             int count = 1000*10;
 7 
 8             Stopwatch watch=new Stopwatch();
 9 
10             //watch.Start();
11             //for (int i = 0; i < count; i++)
12             //{
13             //    int pos= Sunday.GetPositionFirst(context, pattern, true);
14             //}
15             //watch.Stop();
16             //Console.WriteLine(watch.ElapsedMilliseconds);
17 
18             watch.Reset();
19             watch.Start();
20             for (int i = 0; i < count; i++)
21             {
22                 int pos = context.IndexOf(pattern);
23             }
24             watch.Stop();
25             Console.WriteLine(watch.ElapsedMilliseconds);
26 
27             watch.Reset();
28             watch.Start();
29             for (int i = 0; i < count; i++)
30             {
31                 int pos = Sunday.SundaySearch(context, pattern);
32             }
33             watch.Stop();
34             Console.WriteLine(watch.ElapsedMilliseconds);
35         }

在可以找到匹配与不能找到匹配两种情况下,sunday算法耗时大概是indexof的20%左右。算法确实有用。

 

但千万不要使用substring来实现算法,那样会新生成很多字符串中间变量,算法带来的好处远远不如分配内存复制字符串的消耗大,注释掉的部分就是使用substring实现的,比indexof慢很多。

 

 

https://files.cnblogs.com/devourer/SundayTest.7z

 

posted @ 2013-08-19 12:22  ^^!  阅读(879)  评论(0编辑  收藏  举报