sunday算法c#实现 (Boyer-Moore-Horspool-Sunday string search algorithm)
因正则表达式搜索总是出现死循环,开始考虑改为其他搜索方式,因为.net自带的IndexOf默认只能找到第一个或最后一个,如果要把全部的匹配项都找出来,还需要自己写循环SubString,所以想找下有没有现成的,就发现了在这个领域里,BM算法是王道,而sunday算法据说是目前最好的改进版,这一点我没有从国外的网站尤其是wiki上找到印证,但中文谈论sunday的文章很多,我就姑且认为它是最好的吧。
这篇的图文很清晰的描述了算法过程
http://www.cnblogs.com/lbsong/archive/2012/05/25/2518188.html
可惜文中给的代码有的缺陷,如果找不到匹配会报索引越界。我改成这样:
1 public static int IndexOf(string text, string pattern) 2 { 3 return IndexOf(text, pattern, 0, text.Length); 4 } 5 6 public static int IndexOf(string text, string pattern, int startPosition, int count) 7 { 8 if (startPosition < 0) startPosition = 0; 9 if (startPosition >= text.Length) return -1; 10 int endPosition = startPosition + count; 11 if (endPosition < 0) return -1; 12 if (endPosition > text.Length) endPosition = text.Length; 13 if (pattern.Length > endPosition - startPosition) return -1; 14 15 16 int i = startPosition; 17 int j = 0; 18 int m = i + pattern.Length; 19 20 int matchPosition = i; 21 22 while (i < text.Length && j < pattern.Length) 23 { 24 if (text[i] == pattern[j]) 25 { 26 i++; 27 j++; 28 } 29 else 30 { 31 if (m == endPosition) 32 { 33 i = text.Length + 1; 34 break; 35 } 36 m = i + pattern.Length; 37 38 int k = pattern.Length - 1; 39 while (k >= 0 && text[m] != pattern[k]) 40 { 41 k--; 42 } 43 int gap = pattern.Length - k; 44 i += gap; 45 m = i + pattern.Length; 46 if (m > endPosition) m = endPosition; 47 matchPosition = i; 48 j = 0; 49 } 50 } 51 52 if (i <= text.Length) 53 { 54 return matchPosition; 55 } 56 57 return -1; 58 }
好了,现在测试下性能:
1 public static void PerformanceTest() 2 { 3 StreamReader reader = new StreamReader("D:\\LogConfiguration.xml", Encoding.ASCII); 4 string context = reader.ReadToEnd(); 5 string pattern = "xxxx"; 6 int count = 1000*10; 7 8 Stopwatch watch=new Stopwatch(); 9 10 //watch.Start(); 11 //for (int i = 0; i < count; i++) 12 //{ 13 // int pos= Sunday.GetPositionFirst(context, pattern, true); 14 //} 15 //watch.Stop(); 16 //Console.WriteLine(watch.ElapsedMilliseconds); 17 18 watch.Reset(); 19 watch.Start(); 20 for (int i = 0; i < count; i++) 21 { 22 int pos = context.IndexOf(pattern); 23 } 24 watch.Stop(); 25 Console.WriteLine(watch.ElapsedMilliseconds); 26 27 watch.Reset(); 28 watch.Start(); 29 for (int i = 0; i < count; i++) 30 { 31 int pos = Sunday.SundaySearch(context, pattern); 32 } 33 watch.Stop(); 34 Console.WriteLine(watch.ElapsedMilliseconds); 35 }
在可以找到匹配与不能找到匹配两种情况下,sunday算法耗时大概是indexof的20%左右。算法确实有用。
但千万不要使用substring来实现算法,那样会新生成很多字符串中间变量,算法带来的好处远远不如分配内存复制字符串的消耗大,注释掉的部分就是使用substring实现的,比indexof慢很多。
https://files.cnblogs.com/devourer/SundayTest.7z