28. Implement strStr()
题目:
Implement strStr().
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
Update (2014-11-02):
The signature of the function had been updated to return the index instead of the pointer. If you still see your function signature returns a char *
or String
, please click the reload button to reset your code definition.
链接: http://leetcode.com/problems/implement-strstr/
题解:
这道题虽然是easy难度,但解法很多,在一个美好的labor day下午让我头很大。可以有Brute force,KMP,Rabin-Karp,以及Suffix Array / Suffix Tree等等。 有关Suffix Tree真的要好好看一看,虽然现在还不懂,不过有种感觉这是解决String matching问题的终极武器(之一)。像MIT Advanced Data Structures里String那一课里大神Eric Demaine就讲得很清楚。要多看几遍。至于KMP,Rabin-Karp和Suffix Array可以看Princeton大神Sedgewick的课件和booksite。先占坑,根据学习进度一点一点补充各个解法。这些东西以前是unknown unknown, 现在是known unknown,要好好努力把他们变成known known。就像费德勒2016奔驰广告片一样, commitment, pushing your self further,faster,reaching ever high。
下面先来看Brute force:
从头开始暴力查找。 Time Complexity - O(m * n), Space Complexity - O(1).
public class Solution { public int strStr(String haystack, String needle) { if(haystack == null || needle == null || haystack.length() < needle.length()) return -1; if(needle.length() == 0) return 0; for(int i = 0; i <= haystack.length() - needle.length(); i++) { int j = 0; while(i + j < haystack.length() && j < needle.length() && haystack.charAt(i + j) == needle.charAt(j)) { j++; if(j == needle.length()) return i; } } return -1; } }
KMP using DFA:
先根据needle构建一个DFA,然后从haystack的第一个字母开始向后找,找到第一个occurrence或者遍历完haystack则循环结束。R代表alphabet字母表,或者Radix,既needle中distinct char的数量,对ASC II码我们可以简单假设为256。假如needle是Unicode则我们需要使用improved KMP,或者其他方法比如Boyer-Moore。 对于Multi String search,则可以使用Rabin-Karp。
Time Complexity - O(n), Pre-process Time - O(m),Space Complexity - O(R * m)。
public class Solution { private final static int R = 256; private int[][] dfa; public int strStr(String haystack, String needle) { if(haystack == null || needle == null || haystack.length() < needle.length()) return -1; if(needle.length() == 0) return 0; int hsLen = haystack.length(), ndLen = needle.length(); dfa = new int[R][ndLen]; buildDFA(needle); int i, j; for(i = 0, j = 0; i < haystack.length() && j < needle.length(); i++) j = dfa[haystack.charAt(i)][j]; if(j == ndLen) return i - ndLen; else return -1; } private void buildDFA(String needle) { int rowNum = dfa.length, colNum = dfa[0].length; dfa[needle.charAt(0)][0] = 1; for(int x = 0, j = 1; j < colNum; j++) { //x is state, j is col for(int c = 0; c < rowNum; c++) //char dfa[c][j] = dfa[c][x]; //copy mismatch cases dfa[needle.charAt(j)][j] = j + 1; //set match cases x = dfa[needle.charAt(j)][x]; //update state x - restart case; } } }
Improved KMP using NFA:
改进版的KMP。 之前的例子因为R - radix的设置,所以可能导致ac时间反而比较慢,可以改进从而使用NFA,从而与pattern的alphabet无关,对Unicode都有效。构建next数组的部分很巧妙,要注意何时回溯。这段code自己也不太明白,看了很久。也不知道算是NFA还是DFA,有机会的话再来补充。
Time Complexity - O(m + n), Space Complexity - O(m)
public class Solution { private int[] next; public int strStr(String haystack, String needle) { if(haystack == null || needle == null || haystack.length() < needle.length()) return -1; int m = needle.length(), n = haystack.length(); if(n == 0) return 0; buildNFA(needle); int i, j = 0; for(i = 0; i < n && j < m; i++) { while(j >= 0 && haystack.charAt(i) != needle.charAt(j)) j = next[j]; j++; } if(j == m) return i - m; else return -1; } private void buildNFA(String needle) { int m = needle.length(); next = new int[m]; int j = -1; for(int i = 0; i < m; i++) { if(i == 0) //initialize next[i] = -1; else if(needle.charAt(i) != needle.charAt(j)) //back-tracking next[i] = j; else //copy equal cases next[i] = next[j]; while(j >= 0 && needle.charAt(i) != needle.charAt(j)) //back-tracking j = next[j]; j++; } } }
Boyer-Moore:
(待补充)
Rabin-Karp:
使用Rolling-Hash。 最后会用到Monte carlo或者Las Vegas method。 Time Complexity - O(m * n), Space Complexity O(p)
(待补充)
Suffix Array:
以O(n) Time Complexity构建Suffix Array (hard to implement),然后计算。
(待补充)
Suffix Tree:
实际上是Suffix Trie。使用Ukkonen算法以O(n) Time complexity构建Suffix Tree,然后从root向下查找
(待补充)
--------------------------------------------------------------------------------------------华丽的分割线----------------------------------------------------------------------------------------------------------------------------------------------------------------
二刷:
String search经典问题。二刷要用KMP来完成。是时候彻底弄清楚KMP了。网上资料太多太杂,看了CMU,Stanford,Princeton,UBC等学校的lecture notes,每个implementation都不一样,网友都要被你们玩坏了。这样子看来其实KMP论文只是给了一种generalize的idea,像一个接口,每个人都可以根据这种idea自己来implement个性化的KMP实现。只要最后preprocessing以及search两部分能对上,就是一个正确的kmp实现。
常用的一般有两种匹配方式:
- 求出对于每个字符,为了找出self-overlaps来决定我们可以skip多少重复的匹配,我们要find length of longest proper prefix and matching suffix。根据这个prefix table就可以在匹配的过程中skip重复匹配。这里主要参考了CMU的lecture notes以及youtube video。
- 建立prefix table。例子pattern = "ACACAGT", table= new int[pattern.length()]。
- 这里proper prefix是指,假如当前字符为最后一个"A",它的prefix的集合 - A, AC, ACA, ACAC
- proper suffix是指, 假如当前字符为最后一个“A”, 第一个字符的后缀 - A, CA, ACA, ACAC 从上面的例子可以看出最长长度为“ACA”的3,那么我们 table[4] = 3。我们依次求出上述pattern的prefix table每个值以后得到 table= [0, 0, 1, 2, 3, 0, 0] -
Pattern Prefix Suffix longest self-overlap j table A null null null 1 0 AC A C null 2 0 ACA A, AC A, CA A 3 1 ACAC A, AC, ACA C, AC, CAC AC 4 2 ACACA A, AC, ACA, ACAC A, CA, ACA, CACA ACA 5 3 ACACAG A, AC, ACA, ACAC, ACACA G, AG, CAG, ACAG, CACAG null 6 0 ACACAGT A, AC, ACA, ACAC, ACACA, ACACAG T, GT, AGT, CAGT, ACAGT, CACAGT null 7 0 - 下面是我们用来构建prefix table的逻辑和步骤:
- 根据pattern新建一个数组table,长度为m = pattern.length()
- table[0] = 0,这里我们没有prefix以及suffix,所以值为0
- 初始化i = 1, j = 0,
- 在 i < m的情况下:
- pattern[i] == pattern[j]:
- table[i] = j + 1, 我们把之前求得的longest prefix长度 j ,加 1,然后赋给 table[i]
- i++, 增加i来计算下一个字符
- str[i] != str[j]:
- 假如j > 0, 那么设置j = table[j - 1], 这里我们对 j 进行回溯。 下一步查看之前是否能有 pattern[j] == pattern[i]
- 假如 j = 0,那么说明没有match,
- 我们设置 table[i] = 0
- i++, 增加i来计算下一个字符
- pattern[i] == pattern[j]:
- Search。 Search的过程跟构建prefix table基本一样。稍有不同的地方,就是从遍历pattern变为了遍历text,并且多了一个if语句判断找到了第一个全匹配之后如何继续操作
- 初始化i = 0, j = 0
- n = text.length(), m = pattern.length();
- 在 i < n的情况下:
- pattern[j] == text[i]:
- 假如j == m - 1,这时候找到全匹配, 我们可以return这时候串的开头index: i - m + 1;
- 否则我们进行 i++, j++,继续尝试匹配当前串的下一个字符
- pattern[j] != text[i]:
- 假如 j > 0, 那么我们设置 j = table[j - 1], 对j进行回溯
- 否则我们进行 i++, 没有任何匹配,我们要从text的下一个位置开始重新匹配。
- pattern[j] == text[i]:
- 建立prefix table。例子pattern = "ACACAGT", table= new int[pattern.length()]。
- 求出 KMP的next数组,根据next数组匹配。这里这个next数组其实是一个NFA,可以在匹配失败的时候用来进行回溯。利用这个next数组也可以进行匹配。 这里主要参考了Princeton的lecture notes和code,里面有些步骤还不是很懂,先步骤记录一下。
- Preprocess:
- 设置m = pattern.length(), int[] next = new int[m]
- 设置j = -1 在i = 0; i < m; i++的情况下:
- i == 0, 设置next[i] = -1
- 假如 pattern.charAt(i) != pattern.charAt(j), 设置next[i] = j
- 否则 pattern.charAt(i) == pattern.charAt(j), next[i] = next[j]
- while j >= 0并且 pattern.charAt(i) != pattern.charAt(j), 我们对j进行迭代回溯, j = next[j]
- j++
- Search:
- 定义int i, j。 m = pattern.length(), n = text.length()
- 初始条件i = 0, j = 0, 当 i < n && j < m时:
- while j >= 0 并且 text.charAt(i) != pattern.charAt(j)时, 对j进行迭代回溯, j = next[j]
- j++
- 循环结束后,假如j == m, return i - m 为第一个match pattern的子串头index
- 否则return -1
- Preprocess:
Java:
KMP - using prefix table:
Time Complexity - O(m + n), Space Complexity - O(m)
public class Solution { public int strStr(String haystack, String needle) { if (haystack == null || needle == null || haystack.length() < needle.length()) { return -1; } if (needle.equals("")) { return 0; } int[] prefixTable = preProcess(needle); int i = 0, j = 0; int m = needle.length(), n = haystack.length(); while (i < n) { // KMP Search if (haystack.charAt(i) == needle.charAt(j)) { if (j == m - 1) { return i - m + 1; } i++; j++; } else if (j > 0) { j = prefixTable[j - 1]; } else { i++; } } return -1; } private int[] preProcess(String pattern) { // KMP: building prefix table int m = pattern.length(); int[] prefixTable = new int[m]; int i = 1, j = 0; while (i < m) { if (pattern.charAt(i) == pattern.charAt(j)) { prefixTable[i] = j + 1; i++; j++; } else if (j > 0) { j = prefixTable[j - 1]; } else { prefixTable[i] = 0; i++; } } return prefixTable; } }
KMP - build KMP NFA using next array:
public class Solution { public int strStr(String haystack, String needle) { if (haystack == null || needle == null || haystack.length() < needle.length()) { return -1; } if (needle.equals("")) { return 0; } int[] next = preprocess(needle); int m = needle.length(), n = haystack.length(); int i, j; for (i = 0, j = 0; i < haystack.length() && j < needle.length(); i++) { while (j >= 0 && haystack.charAt(i) != needle.charAt(j)) { j = next[j]; } j++; } if (j == m) { return i - m; } return -1; } private int[] preprocess(String pattern) { // KMP build next array int m = pattern.length(); int next[] = new int[m]; int j = -1; for (int i = 0; i < m; i++) { if (i == 0) { next[i] = -1; } else if (pattern.charAt(i) != pattern.charAt(j)) { next[i] = j; } else { next[i] = next[j]; } while (j >= 0 && pattern.charAt(i) != pattern.charAt(j)) { // here we continue backtracking j j = next[j]; } j++; } return next; } }
Reference:
http://www.cs.cmu.edu/~ab/211/lectures/
http://www.cs.cmu.edu/~ab/211/lectures/Lecture%2018%20-%20String%20Matching-KMP.ppt
https://www.youtube.com/watch?v=5i7oKodCRJo
http://algs4.cs.princeton.edu/53substring/
http://programmerspatch.blogspot.com/2013/02/ukkonens-suffix-tree-algorithm.html
http://web.stanford.edu/~mjkay/gusfield.pdf
http://www.cs.cmu.edu/~avrim/451/lectures/lect1121.pdf
https://www.topcoder.com/community/data-science/data-science-tutorials/introduction-to-string-searching-algorithms/
http://www.geeksforgeeks.org/searching-for-patterns-set-2-kmp-algorithm/
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm
https://web.stanford.edu/class/cs97si/10-string-algorithms.pdf
https://www.cs.princeton.edu/~rs/AlgsDS07/21PatternMatching.pdf
http://www-igm.univ-mlv.fr/~lecroq/string/node8.html
http://www.ics.uci.edu/~eppstein/161/960227.html
http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/