[转载]后缀数组算法总结

原作者：远航之曲；

博主声明：博主太懒了+打字极慢，于是在征得远航之曲大神的同意后直接转载他的博文，只将代码替换为本人的代码，原作者的代码里存在一定数量的让人不知所谓的压行——当然其实市面上大都是这样压着行写的，然而博主业界良心，然后基本每句一注释，最后再次感谢曲大神

又及：变量表与函数表：

int  c[10010];//c[sort key]=the number of the strings which have this key（桶）;
int  x[10010];//x[string's NUM]=the rank with the sort key now;
int  y[10010];//y[the rank of the second key]=string's NUM;
char s[10010];//s[the adress in the string]=the char;
int sa[10010];//sa[the rank of the string]=num;
int rank[10010];//rank[the string's num]=rank;
int height[10010];//height[rank]=len of lcp
int n,m;//n:the len of the string,m:the diction_num
void build_SA();
void build_rank();
void build_height();

View Code

传送门

后缀数组提供了一种新思想——倍增和字符串的有趣结合

先来接触一些基础定义

子串：就是字符串的一部分，必须连续。
后缀：是一种子串，它的结尾必须为字符串的最后。
大小比较：就是字典序比较，从头开始比，不相同的话字典序大的那个大，假如相同就向后移动。假如移到其中一个串的结尾还相同的话，长的那个大。
后缀数组：把所有的后缀编号，排序后把编号存在这个数组里。
名次数组：存的是每个后缀的名次

求sa有很多的方法，比较常用的是倍增法，能在 $O (n \log n)$

我们使用的例子是“aabaaaab”它的结果如图所示:

我们先把它的所有后缀都列出来：

如何排序呢，最好想的就是万能的sort。但是效率太低了，是 $O (n^{2} \log n)$

这样的话我们就需要用到某种 $O (n)$

何为计数排序呢，

一般的排序都需要比较其中元素的值，这种排序的复杂度下限就是 $O (n \log n)$

步骤：

1、初始化辅助数组c[i]。

2、遍历每一个输入元素，如果一个输入元素为i，则辅助数组中相应的c[i]的值加1。执行完毕之后。数组c中存储的就是各个键值在输入数组中出现的次数。

3、再计算一下前缀和，我们就知道了小于每个数的个数了。

4、将a[i]放到它在输出数组的正确位置上。对于一个值来说，c[a[i]]的值就是它在输出数组中的正确位置。在做这步的时候，把c[i]减1，这样就能处理相同值的情况了。

计数排序的时间复杂度就为 $O (n)$

这时候，我们就可以使用倍增。因为倍增的话，每次的关键字长度都会变成它原来的两倍，而在后缀中，满足在同一个字符串中的性质，它其中有很多重叠的部分。实际上，我们可以把倍增出来的关键字变成两份，一份是上次关键字排序出来的数组，因为有重复的元素，所以对于另一半来说，它们一定是上一次排出来序的字符串的某个后缀！因为上一次已经让关键字有序了，所以我们直接把上一次排序的数组后移就可以了！前面留空的就是长度不到关键字长度的串，直接按它们的长度摆上就行了。

现在我们就得到了 $O (n \log n)$

接下来我们一点一点地解释一下关于怎么实现的问题：

这里，

n表示字符串的长度
c是上文所述的辅助数组
x表示rank数组，下标表示关键字的位置，值表示关键字大小（rank），相同的值有相同的rank。初始化为字符串s的每个字符大小（此时x并不代表rank，只借助其值比较相对大小）。在每次迭代后，根据sa重新赋值，并代表rank。其实，我觉得可以这么理解，x存的是每个后缀的前缀的种类。因为一开始前缀的长度为1，所以它存的就是s每个字符的大小。以后将每个前缀有序编号了以后这里存的就是字符串的编号。
y表示排序后的第二关键字数组，下标表示名次，值代表第二关键字的首字符位置。
sa构造完成前表示关键字数组，下标表示名次，值表示关键字的首字符位置，值相同的时候名次根据在原串中相对位置的先后决定；构造完成后表示后缀数组，下标表示名次，值表示后缀的首字符位置。

我们先看一下第一段代码：

 1     for(i=1;i<=m;i++)c[i]=0;
 2     //memset;
 3     for(i=0;i<n;i++)x[i]=s[i];
 4     //make x when the sort key is only one char;
 5     for(i=0;i<n;i++)c[x[i]]++;
 6     //sort;
 7     for(i=2;i<=m;i++)c[i]+=c[i-1];
 8     //Prefix sum to find rank;
 9     for(i=n-1;i>=0;i--)sa[--c[x[i]]]=i;
10     //make sa with the first char in each string;
11 //sort with the first char;

这其实就是对每个后缀的第一位进行一下计数排序。唯一需要注意的一点就是最后一个for是从n-1开始循环。这样的话字符串位置相对靠前的后缀就会先出现

排序完是这样的

然后:

    for(k=1;k<=n;k<<=1){
        int num=0;
        for(i=n-k;i<n;i++)y[num++]=i;
        //get the string which don't have the NO.k char;
        for(i=0;i<n;i++)if(sa[i]>=k)y[num++]=sa[i]-k;
        //use the stringB which has the head char that is the No.k char in the stringA to sort stringA;
//sort with the second key;
... ...
    }

k表示关键字的长度

这里只用了两行语句就完成了对第二关键字的排序

注意因为直接后移上次排序的数组，所以下标需要减k。

第一次排完是这样的:

然后就是一些非常神的操作了:

　　　　 for(i=1;i<=m;i++)c[i]=0;
        //memset;
        for(i=0;i<=n;i++)c[x[i]]++;
        //sort with the last total sort key which could be the first key now;
        for(i=2;i<=m;i++)c[i]+=c[i-1];
        //Prefix sum to find rank;
        for(i=n-1;i>=0;i--)sa[--c[x[y[i]]]]=y[i],y[i]=0;
        //make sa with first K char;
//sort with the first key and as it finish the work, sorting with first K char, is finished;

这部分代码的作用就是用结合两个关键字把总的排序搞出来

我们应该做的，就是先根据第一关键字排序，第一关键字相等时根据第二关键字大小排序。

但是看上去，只进行了一次计数排序啊。

还记得这个计数排序的特点：先根据x的值排序，x值相等时根据出现先后次序排序。

x里面存了上次关键字的排序，在本次排序中即是第一关键字的排序，x的值排序==第一关键字排序，这里的计数排序做的是对的。那么第二关键字呢？

前面对第二关键字进行了排序，在这里x[y[i]]就是根据第二关键字的顺序重新改变了第一关键字的顺序，也就是说在本次计数排序中，出现先后次序排序==第二关键字大小排序。

换句话说，我们先单独对第二关键字排序，根据这个顺序改变第一关键字的顺序，由于在计数排序时首先按照第一关键字的值排序，而第一关键字的值没有改变所以首先还是根据第一关键字排序，改变的是第一关键字相同的时候，出现在前面的第二关键字排在前面。

做到这里就完成了第一第二关键字的合并，得到了合并以后的关键字顺序，它可以用于下次迭代。

        swap(x,y);
        //as y is no use now, we can use it to help change x;
        num=1;x[sa[0]]=1;
        for(i=1;i<n;i++)
            if(y[sa[i]]!=y[sa[i-1]]||y[sa[i]+k]!=y[sa[i-1]+k])
                x[sa[i]]=++num;
            else
                x[sa[i]]=num;
        //if the string with sa[i] has the content which isn't the same as the content of the string with sa[i-1];
            //x[sa[i]] is diferent from x[sa[i-1]];
        if(num>=n)break;
        //it means the work is finished that the key has the same number as the string;
        m=num;
        //expand the ton;
//make the x with the total key now as the first key in the next step;
    }

每次新的迭代要用到rank数组x，由于有了刚求的关键字排序数组sa，要得到rank数组也很容易。但是对于相同的值，rank应该相同，所以要判断一下合并以后的关键字是否相同。

这里给出后几次的供参考

完整版:

void build_SA(){
    int i,j,k;
    for(i=1;i<=m;i++)c[i]=0;
    //memset;
    for(i=0;i<n;i++)x[i]=s[i];
    //make x when the sort key is only one char;
    for(i=0;i<n;i++)c[x[i]]++;
    //sort;
    for(i=2;i<=m;i++)c[i]+=c[i-1];
    //Prefix sum to find rank;
    for(i=n-1;i>=0;i--)sa[--c[x[i]]]=i;
    //make sa with the first char in each string;
//sort with the first char;
    for(k=1;k<=n;k<<=1){
        int num=0;
        for(i=n-k;i<n;i++)y[num++]=i;
        //get the string which don't have the NO.k char;
        for(i=0;i<n;i++)if(sa[i]>=k)y[num++]=sa[i]-k;
        //use the stringB which has the head char that is the No.k char in the stringA to sort stringA;
//sort with the second key;
        for(i=1;i<=m;i++)c[i]=0;
        //memset;
        for(i=0;i<=n;i++)c[x[i]]++;
        //sort with the last total sort key which could be the first key now;
        for(i=2;i<=m;i++)c[i]+=c[i-1];
        //Prefix sum to find rank;
        for(i=n-1;i>=0;i--)sa[--c[x[y[i]]]]=y[i],y[i]=0;
        //make sa with first K char;
//sort with the first key and as it finish the work, sorting with first K char, is finished;
        swap(x,y);
        //as y is no use now, we can use it to help change x;
        num=1;x[sa[0]]=1;
        for(i=1;i<n;i++)
            if(y[sa[i]]!=y[sa[i-1]]||y[sa[i]+k]!=y[sa[i-1]+k])
                x[sa[i]]=++num;
            else
                x[sa[i]]=num;
        //if the string with sa[i] has the content which isn't the same as the content of the string with sa[i-1];
            //x[sa[i]] is diferent from x[sa[i-1]];
        if(num>=n)break;
        //it means the work is finished that the key has the same number as the string;
        m=num;
        //expand the ton;
//make the x with the total key now as the first key in the next step;
    }
}

能够线性计算height[]的值的关键在于h[](height[rank[]])的性质，即h[i]>=h[i-1]-1，下面具体分析一下这个不等式的由来。

我们先把要证什么放在这：对于第i个后缀，设j=sa[rank[i] – 1]，也就是说j是i的按排名来的上一个字符串，按定义来i和j的最长公共前缀就是height[rank[i]]，我们现在就是想知道height[rank[i]]至少是多少，而我们要证明的就是至少是height[rank[i-1]]-1。

好啦，现在开始证吧。

首先我们不妨设第i-1个字符串（这里以及后面指的“第?个字符串”不是按字典序排名来的，是按照首字符在字符串中的位置来的）按字典序排名来的前面的那个字符串是第k个字符串，注意k不一定是i-2，因为第k个字符串是按字典序排名来的i-1前面那个，并不是指在原字符串中位置在i-1前面的那个第i-2个字符串。

这时，依据height[]的定义，第k个字符串和第i-1个字符串的公共前缀自然是height[rank[i-1]]，现在先讨论一下第k+1个字符串和第i个字符串的关系。

第一种情况，第k个字符串和第i-1个字符串的首字符不同，那么第k+1个字符串的排名既可能在i的前面，也可能在i的后面，但没有关系，因为height[rank[i-1]]就是0了呀，那么无论height[rank[i]]是多少都会有height[rank[i]]>=height[rank[i-1]]-1，也就是h[i]>=h[i-1]-1。

第二种情况，第k个字符串和第i-1个字符串的首字符相同，那么由于第k+1个字符串就是第k个字符串去掉首字符得到的，第i个字符串也是第i-1个字符串去掉首字符得到的，那么显然第k+1个字符串要排在第i个字符串前面，要么就产生矛盾了。同时，第k个字符串和第i-1个字符串的最长公共前缀是height[rank[i-1]]，那么自然第k+1个字符串和第i个字符串的最长公共前缀就是height[rank[i-1]]-1。

到此为止，第二种情况的证明还没有完，我们可以试想一下，对于比第i个字符串的字典序排名更靠前的那些字符串，谁和第i个字符串的相似度最高（这里说的相似度是指最长公共前缀的长度）？显然是排名紧邻第i个字符串的那个字符串了呀，即sa[rank[i]-1]。也就是说sa[rank[i]]和sa[rank[i]-1]的最长公共前缀至少是height[rank[i-1]]-1，那么就有height[rank[i]]>=height[rank[i-1]]-1，也即h[i]>=h[i-1]-1。

证明完这些之后，下面的代码也就比较容易看懂了。

void build_rank(){
    for(int i=0;i<n;i++)rank[sa[i]]=i;
    //rank[sa[i]]=i;
}
void build_height(){
    int i,j,k=0;
    for(i=0;i<n;i++){
        if(!rank[i])continue;
        if(k)k--;
        //height[rank[i]]>=height[rank[i-1]]-1;
        j=sa[rank[i]-1];
        while(j+k<n&&i+k<n&&s[j+k]==s[i+k])k++;
        //height[rank[i]]means the len of the lcp of the rank[i] and the rank[i]-1
        height[rank[i]]=k;
    }
}

计算每个字符串的字典序排名:

void build_rank(){
    for(int i=0;i<n;i++)rank[sa[i]]=i;
    //rank[sa[i]]=i;
}

上一次的计算结果是k，首先判断一下如果k是0的话，那么k就不用动了，从首字符开始看第i个字符串和第j个字符串前面有多少是相同的，如果k不为0，按我们前面证明的，最长公共前缀的长度至少是k-1，于是从首字符后面k-1个字符开始检查起即可。

        if(k)k--;
        //height[rank[i]]>=height[rank[i-1]]-1;
        j=sa[rank[i]-1];
        while(j+k<n&&i+k<n&&s[j+k]==s[i+k])k++;
        //height[rank[i]]means the len of the lcp of the rank[i] and the rank[i]-1

总代码:

#include<cstdio>
#include<algorithm>
using namespace std;
int  c[10010];//c[sort key]=the number of the strings which have this key（桶）;
int  x[10010];//x[string's NUM]=the rank with the sort key now;
int  y[10010];//y[the rank of the second key]=string's NUM;
char s[10010];//s[the adress in the string]=the char;
int sa[10010];//sa[the rank of the string]=num;
int rank[10010];//rank[the string's num]=rank;
int height[10010];//height[rank]=len of lcp
int n,m;//n:the len of the string,m:the diction_num
void build_SA();
void build_rank();
void build_height();
int main()
{
    int i,j,k;
    scanf("%d%d",&n,&m);
    //input len diction_num;
    scanf("%s",s);
    //input char;
    build_SA();
    for(i=0;i<n;i++){
        for(j=sa[i];j<n;j++)
            printf("%c",s[j]);
        printf("\n");
    }
    build_rank();
    build_height();
}
void build_SA(){
    int i,j,k;
    for(i=1;i<=m;i++)c[i]=0;
    //memset;
    for(i=0;i<n;i++)x[i]=s[i];
    //make x when the sort key is only one char;
    for(i=0;i<n;i++)c[x[i]]++;
    //sort;
    for(i=2;i<=m;i++)c[i]+=c[i-1];
    //Prefix sum to find rank;
    for(i=n-1;i>=0;i--)sa[--c[x[i]]]=i;
    //make sa with the first char in each string;
//sort with the first char;
    for(k=1;k<=n;k<<=1){
        int num=0;
        for(i=n-k;i<n;i++)y[num++]=i;
        //get the string which don't have the NO.k char;
        for(i=0;i<n;i++)if(sa[i]>=k)y[num++]=sa[i]-k;
        //use the stringB which has the head char that is the No.k char in the stringA to sort stringA;
//sort with the second key;
        for(i=1;i<=m;i++)c[i]=0;
        //memset;
        for(i=0;i<=n;i++)c[x[i]]++;
        //sort with the last total sort key which could be the first key now;
        for(i=2;i<=m;i++)c[i]+=c[i-1];
        //Prefix sum to find rank;
        for(i=n-1;i>=0;i--)sa[--c[x[y[i]]]]=y[i],y[i]=0;
        //make sa with first K char;
//sort with the first key and as it finish the work, sorting with first K char, is finished;
        swap(x,y);
        //as y is no use now, we can use it to help change x;
        num=1;x[sa[0]]=1;
        for(i=1;i<n;i++)
            if(y[sa[i]]!=y[sa[i-1]]||y[sa[i]+k]!=y[sa[i-1]+k])
                x[sa[i]]=++num;
            else
                x[sa[i]]=num;
        //if the string with sa[i] has the content which isn't the same as the content of the string with sa[i-1];
            //x[sa[i]] is diferent from x[sa[i-1]];
        if(num>=n)break;
        //it means the work is finished that the key has the same number as the string;
        m=num;
        //expand the ton;
//make the x with the total key now as the first key in the next step;
    }
}
void build_rank(){
    for(int i=0;i<n;i++)rank[sa[i]]=i;
    //rank[sa[i]]=i;
}
void build_height(){
    int i,j,k=0;
    for(i=0;i<n;i++){
        if(!rank[i])continue;
        if(k)k--;
        //height[rank[i]]>=height[rank[i-1]]-1;
        j=sa[rank[i]-1];
        while(j+k<n&&i+k<n&&s[j+k]==s[i+k])k++;
        //height[rank[i]]means the len of the lcp of the rank[i] and the rank[i]-1
        height[rank[i]]=k;
    }
}

？？？

简洁版：

 1 #include<cstdio>
 2 #include<algorithm>
 3 using namespace std;
 4 int  c[10010];
 5 int  x[10010];
 6 int  y[10010];
 7 char s[10010];
 8 int sa[10010];
 9 int rank[10010];
10 int height[10010];
11 int n,m;
12 void build_SA();
13 void build_rank();
14 void build_height();
15 int main()
16 {
17     int i,j,k;
18     scanf("%d%d",&n,&m);
19     scanf("%s",s);
20     build_SA();
21     for(i=0;i<n;i++){
22         for(j=sa[i];j<n;j++)
23             printf("%c",s[j]);
24         printf("\n");
25     }
26     build_rank();
27     build_height();
28 }
29 void build_SA(){
30     int i,j,k;
31     for(i=1;i<=m;i++)c[i]=0;
32     for(i=0;i<n;i++)x[i]=s[i];
33     for(i=0;i<n;i++)c[x[i]]++;
34     for(i=2;i<=m;i++)c[i]+=c[i-1];
35     for(i=n-1;i>=0;i--)sa[--c[x[i]]]=i;
36     for(k=1;k<=n;k<<=1){
37         int num=0;
38         for(i=n-k;i<n;i++)y[num++]=i;
39         for(i=0;i<n;i++)if(sa[i]>=k)y[num++]=sa[i]-k;
40         for(i=1;i<=m;i++)c[i]=0;
41         for(i=0;i<=n;i++)c[x[i]]++;
42         for(i=2;i<=m;i++)c[i]+=c[i-1];
43         for(i=n-1;i>=0;i--)sa[--c[x[y[i]]]]=y[i],y[i]=0;
44         swap(x,y);
45         num=1;x[sa[0]]=1;
46         for(i=1;i<n;i++)
47             if(y[sa[i]]!=y[sa[i-1]]||y[sa[i]+k]!=y[sa[i-1]+k])
48                 x[sa[i]]=++num;
49             else
50                 x[sa[i]]=num;
51         if(num>=n)break;
52         m=num;
53     }
54 }
55 void build_rank(){
56     for(int i=0;i<n;i++)rank[sa[i]]=i;
57 }
58 void build_height(){
59     int i,j,k=0;
60     for(i=0;i<n;i++){
61         if(!rank[i])continue;
62         if(k)k--;
63         j=sa[rank[i]-1];
64         while(j+k<n&&i+k<n&&s[j+k]==s[i+k])k++;
65         height[rank[i]]=k;
66     }
67 }

posted @ 2017-03-26 10:15 F.W.Nietzsche 阅读(5070) 评论(0) 编辑收藏举报

指间灵动，快码加编

刷新页面返回顶部

Der Barde, Nietzsche

一个OIer的blog罢了

[转载]后缀数组算法总结

公告