字符串匹配算法
概念明确:被匹配串S、匹配串P。如从cbabce找ab,前者和后者分别称为被匹配串、匹配串。设S长度为n、P长度为k
暴力算法
最容易想到的方法:从首字母开始,逐个比较下去。一旦发现有不同的字符就停止并将这个匹配串后移一位,然后从头开始进行下一次比较。这样,就需要将字串中的所有字符一一比较。
KMP算法(1970)
KMP:Knuth-Morris-Pratt,三个发明者的名字首字母
基于的事实:不匹配时利用“部分匹配表”跳过尽可能多的无法匹配的位置。
算法主要过程:预先根据P算出"部分匹配表";P从前往后移动与S进行匹配,每次匹配时从前往后依次对比字符,若遇到不一样的字符,则P的此字符之前的部分是匹配的,称为前缀子串,从部分匹配表查得已匹配的部分串(前缀子串)的部分匹配值,从而算得P应后移的位数:移动位数 = 前缀子串的字符数 - 查得的部分匹配值
部分匹配值:前缀真子串和后缀真子串的最长的共有元素的长度。如ABD的前缀真子串有A、AB,后缀真子串有BD、D,其最长共有元素长度为0,故ABD的部分匹配值为0。
部分匹配表:对P每个前缀子串求部分匹配值,就得到P的部分匹配表。
"部分匹配"的实质:有时候,字符串头部和尾部会有重复。比如,"ABCDAB"之中有两个"AB",那么它的"部分匹配值"就是2("AB"的长度)。搜索词移动的时候,第一个"AB"向后移动4位(字符串长度-部分匹配值),就可以来到第二个"AB"的位置。
参阅:http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html
java代码实现(来自 https://algs4.cs.princeton.edu/53substring/KMP.java.html):
KMP.java Below is the syntax highlighted version of KMP.java from §5.3 Substring Search. /****************************************************************************** * Compilation: javac KMP.java * Execution: java KMP pattern text * Dependencies: StdOut.java * * Reads in two strings, the pattern and the input text, and * searches for the pattern in the input text using the * KMP algorithm. * * % java KMP abracadabra abacadabrabracabracadabrabrabracad * text: abacadabrabracabracadabrabrabracad * pattern: abracadabra * * % java KMP rab abacadabrabracabracadabrabrabracad * text: abacadabrabracabracadabrabrabracad * pattern: rab * * % java KMP bcara abacadabrabracabracadabrabrabracad * text: abacadabrabracabracadabrabrabracad * pattern: bcara * * % java KMP rabrabracad abacadabrabracabracadabrabrabracad * text: abacadabrabracabracadabrabrabracad * pattern: rabrabracad * * % java KMP abacad abacadabrabracabracadabrabrabracad * text: abacadabrabracabracadabrabrabracad * pattern: abacad * ******************************************************************************/ /** * The {@code KMP} class finds the first occurrence of a pattern string * in a text string. * <p> * This implementation uses a version of the Knuth-Morris-Pratt substring search * algorithm. The version takes time proportional to <em>n</em> + <em>m R</em> * in the worst case, where <em>n</em> is the length of the text string, * <em>m</em> is the length of the pattern, and <em>R</em> is the alphabet size. * It uses extra space proportional to <em>m R</em>. * <p> * For additional documentation, * see <a href="https://algs4.cs.princeton.edu/53substring">Section 5.3</a> of * <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne. */ public class KMP { private final int R; // the radix private int[][] dfa; // the KMP automoton private char[] pattern; // either the character array for the pattern private String pat; // or the pattern string /** * Preprocesses the pattern string. * * @param pat the pattern string */ public KMP(String pat) { this.R = 256; this.pat = pat; // build DFA from pattern int m = pat.length(); dfa = new int[R][m]; dfa[pat.charAt(0)][0] = 1; for (int x = 0, j = 1; j < m; j++) { for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][x]; // Copy mismatch cases. dfa[pat.charAt(j)][j] = j+1; // Set match case. x = dfa[pat.charAt(j)][x]; // Update restart state. } } /** * Preprocesses the pattern string. * * @param pattern the pattern string * @param R the alphabet size */ public KMP(char[] pattern, int R) { this.R = R; this.pattern = new char[pattern.length]; for (int j = 0; j < pattern.length; j++) this.pattern[j] = pattern[j]; // build DFA from pattern int m = pattern.length; dfa = new int[R][m]; dfa[pattern[0]][0] = 1; for (int x = 0, j = 1; j < m; j++) { for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][x]; // Copy mismatch cases. dfa[pattern[j]][j] = j+1; // Set match case. x = dfa[pattern[j]][x]; // Update restart state. } } /** * Returns the index of the first occurrrence of the pattern string * in the text string. * * @param txt the text string * @return the index of the first occurrence of the pattern string * in the text string; N if no such match */ public int search(String txt) { // simulate operation of DFA on text int m = pat.length(); int n = txt.length(); int i, j; for (i = 0, j = 0; i < n && j < m; i++) { j = dfa[txt.charAt(i)][j]; } if (j == m) return i - m; // found return n; // not found } /** * Returns the index of the first occurrrence of the pattern string * in the text string. * * @param text the text string * @return the index of the first occurrence of the pattern string * in the text string; N if no such match */ public int search(char[] text) { // simulate operation of DFA on text int m = pattern.length; int n = text.length; int i, j; for (i = 0, j = 0; i < n && j < m; i++) { j = dfa[text[i]][j]; } if (j == m) return i - m; // found return n; // not found } /** * Takes a pattern string and an input string as command-line arguments; * searches for the pattern string in the text string; and prints * the first occurrence of the pattern string in the text string. * * @param args the command-line arguments */ public static void main(String[] args) { String pat = args[0]; String txt = args[1]; char[] pattern = pat.toCharArray(); char[] text = txt.toCharArray(); KMP kmp1 = new KMP(pat); int offset1 = kmp1.search(txt); KMP kmp2 = new KMP(pattern, 256); int offset2 = kmp2.search(text); // print results StdOut.println("text: " + txt); StdOut.print("pattern: "); for (int i = 0; i < offset1; i++) StdOut.print(" "); StdOut.println(pat); StdOut.print("pattern: "); for (int i = 0; i < offset2; i++) StdOut.print(" "); StdOut.println(pat); } } Copyright © 2000–2017, Robert Sedgewick and Kevin Wayne. Last updated: Tue Feb 6 02:05:56 EST 2018.
Boyer-Moore算法(1977)
基于的事实:对于每一次失败的匹配尝试,跳过尽可能多的无法匹配的位置。
算法主要过程:P从前往后移动与S进行匹配,每次匹配时从后往前依次对比字符,若遇到不一样的字符(假设S中的字符为c)则在P尚未比较的剩下字符中从后往前找字符c出现的位置:1、若找不到则P后移到c之后进行下一次匹配;2、否则后移P使得c与该位置对齐。实际上,若在遇到不匹配字符时有部分后缀匹配了(称为“好后缀”),则可利用这后缀信息,以在有些情况下可以跳过更多位置(实际上不用“好后缀”也是可以得到结果的),可参阅后面所列的文章。
复杂度:O(n+k),且k越大(即搜索串)越长,速度越快,因为能跳过越多的无无法匹配的字符从而减少比较次数
参阅:http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html