28. Implement strStr()

题目：

Implement strStr().

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Update (2014-11-02):
The signature of the function had been updated to return the index instead of the pointer. If you still see your function signature returns a char * or String, please click the reload button to reset your code definition.

链接： http://leetcode.com/problems/implement-strstr/

题解:

这道题虽然是easy难度，但解法很多，在一个美好的labor day下午让我头很大。可以有Brute force，KMP，Rabin-Karp，以及Suffix Array / Suffix Tree等等。有关Suffix Tree真的要好好看一看，虽然现在还不懂，不过有种感觉这是解决String matching问题的终极武器(之一)。像MIT Advanced Data Structures里String那一课里大神Eric Demaine就讲得很清楚。要多看几遍。至于KMP，Rabin-Karp和Suffix Array可以看Princeton大神Sedgewick的课件和booksite。先占坑，根据学习进度一点一点补充各个解法。这些东西以前是unknown unknown，现在是known unknown，要好好努力把他们变成known known。就像费德勒2016奔驰广告片一样， commitment， pushing your self further，faster，reaching ever high。

下面先来看Brute force:

从头开始暴力查找。 Time Complexity - O(m * n)， Space Complexity - O(1).

public class Solution {
    public int strStr(String haystack, String needle) {
        if(haystack == null || needle == null || haystack.length() < needle.length())
            return -1;
        if(needle.length() == 0)
            return 0;
        
        for(int i = 0; i <= haystack.length() - needle.length(); i++) {
            int j = 0;
            
            while(i + j < haystack.length() && j < needle.length() && haystack.charAt(i + j) == needle.charAt(j)) {
                j++;
                if(j == needle.length()) 
                    return i;
            }
        }
        
        return -1;
    }
}

KMP using DFA:

先根据needle构建一个DFA，然后从haystack的第一个字母开始向后找，找到第一个occurrence或者遍历完haystack则循环结束。R代表alphabet字母表，或者Radix，既needle中distinct char的数量，对ASC II码我们可以简单假设为256。假如needle是Unicode则我们需要使用improved KMP，或者其他方法比如Boyer-Moore。对于Multi String search，则可以使用Rabin-Karp。

Time Complexity - O(n)， Pre-process Time - O(m)，Space Complexity - O(R * m)。

public class Solution {
    private final static int R = 256;
    private int[][] dfa;
    
    public int strStr(String haystack, String needle) {
        if(haystack == null || needle == null || haystack.length() < needle.length())
            return -1;
        if(needle.length() == 0)
            return 0;
        int hsLen = haystack.length(), ndLen = needle.length();
        dfa = new int[R][ndLen];   
        buildDFA(needle);
        int i, j;
        
        for(i = 0, j = 0; i < haystack.length() && j < needle.length(); i++)
            j = dfa[haystack.charAt(i)][j];
        
        if(j == ndLen)
            return i - ndLen;
        else
            return -1;
    }
    
    private void buildDFA(String needle) {
        int rowNum = dfa.length, colNum = dfa[0].length;
        dfa[needle.charAt(0)][0] = 1;
        
        for(int x = 0, j = 1; j < colNum; j++) {    //x is state, j is col
            for(int c = 0; c < rowNum; c++)         //char
                dfa[c][j] = dfa[c][x];              //copy mismatch cases
            dfa[needle.charAt(j)][j] = j + 1;       //set match cases
            x = dfa[needle.charAt(j)][x];           //update state x - restart case;
        }
    }
}

Improved KMP using NFA:

改进版的KMP。之前的例子因为R - radix的设置，所以可能导致ac时间反而比较慢，可以改进从而使用NFA，从而与pattern的alphabet无关，对Unicode都有效。构建next数组的部分很巧妙，要注意何时回溯。这段code自己也不太明白，看了很久。也不知道算是NFA还是DFA，有机会的话再来补充。

Time Complexity - O(m + n)， Space Complexity - O(m)

public class Solution {
    private int[] next;
    
    public int strStr(String haystack, String needle) {
        if(haystack == null || needle == null || haystack.length() < needle.length())
            return -1;
        int m = needle.length(), n = haystack.length();    
        if(n == 0)
            return 0;
        buildNFA(needle);
        int i, j = 0;
        
        for(i = 0; i < n && j < m; i++) {
            while(j >= 0 && haystack.charAt(i) != needle.charAt(j))
                j = next[j];
            j++;
        }
        
        if(j == m)
            return i - m;
        else
            return -1;
    }
    
    private void buildNFA(String needle) {
        int m = needle.length();
        next = new int[m];
        int j = -1;
        
        for(int i = 0; i < m; i++) {
            if(i == 0)                                              //initialize
                next[i] = -1;                                   
            else if(needle.charAt(i) != needle.charAt(j))           //back-tracking
                next[i] = j;
            else                                                    //copy equal cases
                next[i] = next[j];
                
            while(j >= 0 && needle.charAt(i) != needle.charAt(j))   //back-tracking
                j = next[j];
            j++;
        }
    }
}

Boyer-Moore:

(待补充)

Rabin-Karp:

使用Rolling-Hash。最后会用到Monte carlo或者Las Vegas method。 Time Complexity - O(m * n)， Space Complexity O(p)

(待补充)

Suffix Array:

以O(n) Time Complexity构建Suffix Array （hard to implement），然后计算。

(待补充)

Suffix Tree:

实际上是Suffix Trie。使用Ukkonen算法以O(n) Time complexity构建Suffix Tree，然后从root向下查找

(待补充)

--------------------------------------------------------------------------------------------华丽的分割线----------------------------------------------------------------------------------------------------------------------------------------------------------------

二刷：

String search经典问题。二刷要用KMP来完成。是时候彻底弄清楚KMP了。网上资料太多太杂，看了CMU，Stanford，Princeton，UBC等学校的lecture notes，每个implementation都不一样，网友都要被你们玩坏了。这样子看来其实KMP论文只是给了一种generalize的idea，像一个接口，每个人都可以根据这种idea自己来implement个性化的KMP实现。只要最后preprocessing以及search两部分能对上，就是一个正确的kmp实现。

常用的一般有两种匹配方式:

求出对于每个字符，为了找出self-overlaps来决定我们可以skip多少重复的匹配，我们要find length of longest proper prefix and matching suffix。根据这个prefix table就可以在匹配的过程中skip重复匹配。这里主要参考了CMU的lecture notes以及youtube video。

建立prefix table。例子pattern = "ACACAGT"， table= new int[pattern.length()]。

这里proper prefix是指，假如当前字符为最后一个"A"，它的prefix的集合 - A, AC, ACA, ACAC
proper suffix是指，假如当前字符为最后一个“A”，第一个字符的后缀 - A, CA, ACA, ACAC

Pattern	Prefix	Suffix	longest self-overlap	j	table
A	null	null	null	1	0
AC	A	C	null	2	0
ACA	A, AC	A, CA	A	3	1
ACAC	A, AC, ACA	C, AC, CAC	AC	4	2
ACACA	A, AC, ACA, ACAC	A, CA, ACA, CACA	ACA	5	3
ACACAG	A, AC, ACA, ACAC, ACACA	G, AG, CAG, ACAG, CACAG	null	6	0
ACACAGT	A, AC, ACA, ACAC, ACACA, ACACAG	T, GT, AGT, CAGT, ACAGT, CACAGT	null	7	0

下面是我们用来构建prefix table的逻辑和步骤:
1. 根据pattern新建一个数组table，长度为m = pattern.length()
2. table[0] = 0，这里我们没有prefix以及suffix，所以值为0
3. 初始化i = 1， j = 0，
4. 在 i < m的情况下:
  1. pattern[i] == pattern[j]:
    1. table[i] = j + 1，我们把之前求得的longest prefix长度 j ，加 1，然后赋给 table[i]
    2. i++，增加i来计算下一个字符
  2. str[i] != str[j]:
    1. 假如j > 0，那么设置j = table[j - 1]，这里我们对 j 进行回溯。下一步查看之前是否能有 pattern[j] == pattern[i]
    2. 假如 j = 0，那么说明没有match，
      1. 我们设置 table[i] = 0
      2. i++, 增加i来计算下一个字符

Search。 Search的过程跟构建prefix table基本一样。稍有不同的地方，就是从遍历pattern变为了遍历text，并且多了一个if语句判断找到了第一个全匹配之后如何继续操作
1. 初始化i = 0， j = 0
2. n = text.length()， m = pattern.length();
3. 在 i < n的情况下：
  1. pattern[j] == text[i]:
    1. 假如j == m - 1，这时候找到全匹配，我们可以return这时候串的开头index: i - m + 1;
    2. 否则我们进行 i++, j++，继续尝试匹配当前串的下一个字符
  2. pattern[j] != text[i]:
    1. 假如 j > 0，那么我们设置 j = table[j - 1]，对j进行回溯
    2. 否则我们进行 i++，没有任何匹配，我们要从text的下一个位置开始重新匹配。

求出 KMP的next数组，根据next数组匹配。这里这个next数组其实是一个NFA，可以在匹配失败的时候用来进行回溯。利用这个next数组也可以进行匹配。这里主要参考了Princeton的lecture notes和code，里面有些步骤还不是很懂，先步骤记录一下。
1. Preprocess:
  1. 设置m = pattern.length()， int[] next = new int[m]
  2. 设置j = -1
2. Search:
  1. 定义int i, j。 m = pattern.length(), n = text.length()
  2. 初始条件i = 0, j = 0，当 i < n && j < m时:
    1. while j >= 0 并且 text.charAt(i) != pattern.charAt(j)时，对j进行迭代回溯， j = next[j]
    2. j++
  3. 循环结束后，假如j == m， return i - m 为第一个match pattern的子串头index
  4. 否则return -1

Java:

KMP - using prefix table:

Time Complexity - O(m + n)， Space Complexity - O(m)

public class Solution {
    public int strStr(String haystack, String needle) {
        if (haystack == null || needle == null || haystack.length() < needle.length()) {
            return -1;
        }
        if (needle.equals("")) {
            return 0;
        }
        int[] prefixTable = preProcess(needle);
        int i = 0, j = 0;
        int m = needle.length(), n = haystack.length();
        while (i < n) {                                         // KMP Search 
            if (haystack.charAt(i) == needle.charAt(j)) {
                if (j == m - 1) {
                    return i - m + 1;
                }
                i++;
                j++;
            } else if (j > 0) {
                j = prefixTable[j - 1];
            } else {
                i++;
            }
        }
        return -1;
    }
    
    private int[] preProcess(String pattern) {  // KMP: building prefix table
        int m = pattern.length();
        int[] prefixTable = new int[m];
        int i = 1, j = 0;
        while (i < m) {
            if (pattern.charAt(i) == pattern.charAt(j)) {
                prefixTable[i] = j + 1;
                i++;
                j++;
            } else if (j > 0) {
                j = prefixTable[j - 1];
            } else {
                prefixTable[i] = 0;
                i++;
            }
        }
        return prefixTable;
    }
}

KMP - build KMP NFA using next array:

public class Solution {
    public int strStr(String haystack, String needle) {
        if (haystack == null || needle == null || haystack.length() < needle.length()) {
            return -1;
        }
        if (needle.equals("")) {
            return 0;
        }
        int[] next = preprocess(needle);
        int m = needle.length(), n = haystack.length();
        int i, j;
        for (i = 0, j = 0; i < haystack.length() && j < needle.length(); i++) {
            while (j >= 0 && haystack.charAt(i) != needle.charAt(j)) {
                j = next[j];
            }
            j++;
        }
        if (j == m) {
            return i - m;
        }
        return -1;
    }
    
    private int[] preprocess(String pattern) {  // KMP build next array
        int m = pattern.length();
        int next[] = new int[m];
        int j = -1;
        for (int i = 0; i < m; i++) {
            if (i == 0) {
                next[i] = -1;
            } else if (pattern.charAt(i) != pattern.charAt(j)) {
                next[i] = j;
            } else {
                next[i] = next[j];
            }
            while (j >= 0 && pattern.charAt(i) != pattern.charAt(j)) {   // here we continue backtracking j
                j = next[j];
            }
            j++;
        }
        return next;
    }
}

Reference:

http://www.cs.cmu.edu/~ab/211/lectures/

http://www.cs.cmu.edu/~ab/211/lectures/Lecture%2018%20-%20String%20Matching-KMP.ppt

https://www.youtube.com/watch?v=5i7oKodCRJo

http://algs4.cs.princeton.edu/53substring/

http://programmerspatch.blogspot.com/2013/02/ukkonens-suffix-tree-algorithm.html

http://web.stanford.edu/~mjkay/gusfield.pdf

http://www.cs.cmu.edu/~avrim/451/lectures/lect1121.pdf

https://www.topcoder.com/community/data-science/data-science-tutorials/introduction-to-string-searching-algorithms/

http://www.geeksforgeeks.org/searching-for-patterns-set-2-kmp-algorithm/

https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

https://web.stanford.edu/class/cs97si/10-string-algorithms.pdf

https://www.cs.princeton.edu/~rs/AlgsDS07/21PatternMatching.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

http://www.ics.uci.edu/~eppstein/161/960227.html

http://jakeboxer.com/blog/2009/12/13/the-knuth-morris-pratt-algorithm-in-my-own-words/

posted @ 2015-04-17 15:30 YRB 阅读(636) 评论(0) 编辑收藏举报

刷新页面返回顶部

YRB

28. Implement strStr()

公告