字符串匹配算法

概念明确：被匹配串S、匹配串P。如从cbabce找ab，前者和后者分别称为被匹配串、匹配串。设S长度为n、P长度为k

暴力算法

最容易想到的方法：从首字母开始，逐个比较下去。一旦发现有不同的字符就停止并将这个匹配串后移一位，然后从头开始进行下一次比较。这样，就需要将字串中的所有字符一一比较。

KMP算法（1970）

KMP：Knuth-Morris-Pratt，三个发明者的名字首字母

基于的事实：不匹配时利用“部分匹配表”跳过尽可能多的无法匹配的位置。

算法主要过程：预先根据P算出"部分匹配表"；P从前往后移动与S进行匹配，每次匹配时从前往后依次对比字符，若遇到不一样的字符，则P的此字符之前的部分是匹配的，称为前缀子串，从部分匹配表查得已匹配的部分串（前缀子串）的部分匹配值，从而算得P应后移的位数：移动位数 = 前缀子串的字符数 - 查得的部分匹配值

部分匹配值：前缀真子串和后缀真子串的最长的共有元素的长度。如ABD的前缀真子串有A、AB，后缀真子串有BD、D，其最长共有元素长度为0，故ABD的部分匹配值为0。

部分匹配表：对P每个前缀子串求部分匹配值，就得到P的部分匹配表。

"部分匹配"的实质：有时候，字符串头部和尾部会有重复。比如，"ABCDAB"之中有两个"AB"，那么它的"部分匹配值"就是2（"AB"的长度）。搜索词移动的时候，第一个"AB"向后移动4位（字符串长度-部分匹配值），就可以来到第二个"AB"的位置。

参阅：http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html

java代码实现（来自 https://algs4.cs.princeton.edu/53substring/KMP.java.html）：

KMP.java

Below is the syntax highlighted version of KMP.java from §5.3 Substring Search.


/******************************************************************************
 *  Compilation:  javac KMP.java
 *  Execution:    java KMP pattern text
 *  Dependencies: StdOut.java
 *
 *  Reads in two strings, the pattern and the input text, and
 *  searches for the pattern in the input text using the
 *  KMP algorithm.
 *
 *  % java KMP abracadabra abacadabrabracabracadabrabrabracad
 *  text:    abacadabrabracabracadabrabrabracad 
 *  pattern:               abracadabra          
 *
 *  % java KMP rab abacadabrabracabracadabrabrabracad
 *  text:    abacadabrabracabracadabrabrabracad 
 *  pattern:         rab
 *
 *  % java KMP bcara abacadabrabracabracadabrabrabracad
 *  text:    abacadabrabracabracadabrabrabracad 
 *  pattern:                                   bcara
 *
 *  % java KMP rabrabracad abacadabrabracabracadabrabrabracad 
 *  text:    abacadabrabracabracadabrabrabracad
 *  pattern:                        rabrabracad
 *
 *  % java KMP abacad abacadabrabracabracadabrabrabracad
 *  text:    abacadabrabracabracadabrabrabracad
 *  pattern: abacad
 *
 ******************************************************************************/

/**
 *  The {@code KMP} class finds the first occurrence of a pattern string
 *  in a text string.
 *  <p>
 *  This implementation uses a version of the Knuth-Morris-Pratt substring search
 *  algorithm. The version takes time proportional to <em>n</em> + <em>m R</em>
 *  in the worst case, where <em>n</em> is the length of the text string,
 *  <em>m</em> is the length of the pattern, and <em>R</em> is the alphabet size.
 *  It uses extra space proportional to <em>m R</em>.
 *  <p>
 *  For additional documentation,
 *  see <a href="https://algs4.cs.princeton.edu/53substring">Section 5.3</a> of
 *  <i>Algorithms, 4th Edition</i> by Robert Sedgewick and Kevin Wayne.
 */
public class KMP {
    private final int R;       // the radix
    private int[][] dfa;       // the KMP automoton

    private char[] pattern;    // either the character array for the pattern
    private String pat;        // or the pattern string

    /**
     * Preprocesses the pattern string.
     *
     * @param pat the pattern string
     */
    public KMP(String pat) {
        this.R = 256;
        this.pat = pat;

        // build DFA from pattern
        int m = pat.length();
        dfa = new int[R][m]; 
        dfa[pat.charAt(0)][0] = 1; 
        for (int x = 0, j = 1; j < m; j++) {
            for (int c = 0; c < R; c++) 
                dfa[c][j] = dfa[c][x];     // Copy mismatch cases. 
            dfa[pat.charAt(j)][j] = j+1;   // Set match case. 
            x = dfa[pat.charAt(j)][x];     // Update restart state. 
        } 
    } 

    /**
     * Preprocesses the pattern string.
     *
     * @param pattern the pattern string
     * @param R the alphabet size
     */
    public KMP(char[] pattern, int R) {
        this.R = R;
        this.pattern = new char[pattern.length];
        for (int j = 0; j < pattern.length; j++)
            this.pattern[j] = pattern[j];

        // build DFA from pattern
        int m = pattern.length;
        dfa = new int[R][m]; 
        dfa[pattern[0]][0] = 1; 
        for (int x = 0, j = 1; j < m; j++) {
            for (int c = 0; c < R; c++) 
                dfa[c][j] = dfa[c][x];     // Copy mismatch cases. 
            dfa[pattern[j]][j] = j+1;      // Set match case. 
            x = dfa[pattern[j]][x];        // Update restart state. 
        } 
    } 

    /**
     * Returns the index of the first occurrrence of the pattern string
     * in the text string.
     *
     * @param  txt the text string
     * @return the index of the first occurrence of the pattern string
     *         in the text string; N if no such match
     */
    public int search(String txt) {

        // simulate operation of DFA on text
        int m = pat.length();
        int n = txt.length();
        int i, j;
        for (i = 0, j = 0; i < n && j < m; i++) {
            j = dfa[txt.charAt(i)][j];
        }
        if (j == m) return i - m;    // found
        return n;                    // not found
    }

    /**
     * Returns the index of the first occurrrence of the pattern string
     * in the text string.
     *
     * @param  text the text string
     * @return the index of the first occurrence of the pattern string
     *         in the text string; N if no such match
     */
    public int search(char[] text) {

        // simulate operation of DFA on text
        int m = pattern.length;
        int n = text.length;
        int i, j;
        for (i = 0, j = 0; i < n && j < m; i++) {
            j = dfa[text[i]][j];
        }
        if (j == m) return i - m;    // found
        return n;                    // not found
    }


    /** 
     * Takes a pattern string and an input string as command-line arguments;
     * searches for the pattern string in the text string; and prints
     * the first occurrence of the pattern string in the text string.
     *
     * @param args the command-line arguments
     */
    public static void main(String[] args) {
        String pat = args[0];
        String txt = args[1];
        char[] pattern = pat.toCharArray();
        char[] text    = txt.toCharArray();

        KMP kmp1 = new KMP(pat);
        int offset1 = kmp1.search(txt);

        KMP kmp2 = new KMP(pattern, 256);
        int offset2 = kmp2.search(text);

        // print results
        StdOut.println("text:    " + txt);

        StdOut.print("pattern: ");
        for (int i = 0; i < offset1; i++)
            StdOut.print(" ");
        StdOut.println(pat);

        StdOut.print("pattern: ");
        for (int i = 0; i < offset2; i++)
            StdOut.print(" ");
        StdOut.println(pat);
    }
}


Copyright © 2000–2017, Robert Sedgewick and Kevin Wayne.
Last updated: Tue Feb 6 02:05:56 EST 2018.

View Code

Boyer-Moore算法（1977）

基于的事实：对于每一次失败的匹配尝试，跳过尽可能多的无法匹配的位置。

算法主要过程：P从前往后移动与S进行匹配，每次匹配时从后往前依次对比字符，若遇到不一样的字符（假设S中的字符为c）则在P尚未比较的剩下字符中从后往前找字符c出现的位置：1、若找不到则P后移到c之后进行下一次匹配；2、否则后移P使得c与该位置对齐。实际上，若在遇到不匹配字符时有部分后缀匹配了（称为“好后缀”），则可利用这后缀信息，以在有些情况下可以跳过更多位置（实际上不用“好后缀”也是可以得到结果的），可参阅后面所列的文章。

复杂度：O(n+k)，且k越大（即搜索串）越长，速度越快，因为能跳过越多的无无法匹配的字符从而减少比较次数

参阅：http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html

posted @ 2019-11-26 12:00 March On 阅读(856) 评论(0) 编辑收藏举报

刷新页面返回顶部

MarchOn

【好记性不如烂笔头】、【众纷繁技术多同宗，当透过现象看本质】

字符串匹配算法

暴力算法

KMP算法（1970）

Boyer-Moore算法（1977）