构建后缀树的Ukkonen算法及其实现
Ukkonen算法(简称ukk算法)是一个online算法,它与mcc算法的一个显著区别是每次只对S的一个前缀生成隐式后缀树(implicit suffix tree),然后考虑S的下一个字符S[i+1]并将S[0...i+1]的所有后缀加入到上一个阶段中生成的隐式后缀树中,形成一个新的隐式后缀树。最后用一个特殊字符将隐式后缀树自动转换成真实的后缀树。这样ukk的一个最大优点就是不需要事先知道输入字串的全部内容,只需使用增量方式生成后缀树。和mcc算法类似,也是采用压缩存储Trie,以达到节省空间的目的。通过使用implicit extensions和suffix link两大技巧,时间复杂度可以达到线性。
*****************************名词术语*****************************
为便于理解算法,先介绍几个名词。
隐式后缀树(implicit suffix tree):后缀可能终止于叶子结点,也可能隐藏在内部结点中。如果输入串中最后一个字符不同于其他字符,那么所有的后缀都终止于叶子结点,不会有后缀隐藏在内部结点中。这就是为什么ukk算法中最后一个字符必须是特殊字符的原因。
阶段(phase):在阶段i+1中,将考虑S[i+1]进来并将S[0...i+1]的所有后缀加入到上一个阶段i生成的隐式后缀树中,形成一个新的隐式后缀树。
扩展(extension):在每个阶段中,需要将每一个后缀加入到上一个阶段的隐式后缀树中,每个后缀加入操作叫做extension j。例如要生成mississippi$的后缀树,在phase i+1=4时,需要将后缀missi,issi,ssi,si,i分别加入到phase 3生成的隐式后缀树中,换句话说,这个阶段中共有5个extensions,分别将子串S[j...4]加入到phase 3生成的隐式后缀树中,其中j的取值范围为[0...4],。
后缀链(suffix link):参考前面关于mcc算法的博文(注意:只有内部结点才有suffix link,叶子结点无需suffix link)。
*****************************规则1,2,3*****************************
生成隐式后缀树的过程,需要依照三个规则来处理每一个extension(以下叙述假设针对phase i+1):
规则1)如果S[j...i]的最后一个字符在叶子结点中,那么直接将S[i+1]附加到S[j...i]后面。
规则2)如果S[j...i]的最后一个字符在不在叶子结点中,而且当前的隐式后缀树中该路径上的下一个字符c不等于S[i+1],那么需要生成一个新的结点。这个规则也包括在root结点产生一个叶子结点的情况。
规则3)如果在规则2中,下一个字符c等于S[i+1],则不需要做任何操作,因为当前后缀S[j...i+1]已经存在于隐式后缀树中。
*****************************ukk算法复杂度能达到O(n)的原因*****************************
如果不是用implicit extensions和suffix link,ukk算法的复杂度实际上为O(n^3),这比蛮力法的性能还差,显然这是不能接受的。由于有implicit extensions和压缩Trie结构,在增量处理时可以对已经是叶子的结点进行简化(采用规则1)。在每一个extension中针对规则2和规则3的情况,通过suffix link可以直接跳到某个结点,而不需要从root结点搜索,另外一旦应用规则3(原因是由后缀树的特点造成的),则整个phase就可以提前结束(这叫做"show-stopper"技巧)。这两大技巧使得ukk算法的平摊复杂度达到线性。
由于产生叶子结点的来源只有通过规则2这一条渠道,而且叶子结点永远都是叶子结点("once a leaf, always a leaf"),因此当前phase中如果有应用规则2产生的叶子结点,那么它们会加入到下一个phase中的叶子结点集合中,在下一个阶段继续应用规则1(属于implicit extension),而不是规则2。设当前phase中最后一个应用规则2产生的叶子结点的extension编号为变量j_star,在下一个phase中应用规则2和规则3的结点只需要从j_star+1开始。由于所有的叶子结点属于implicit extensions,可以简化操作,只有规则2需要采用类似mcc算法的fastscan和slowscan找到u和v(以及建立上一个v到当前v的suffix link)。而在每一个phase中规则3的应用机会最多只有一次。
在具体实现中,考虑到建立suffix link不会跨越相邻两个phases(因为在上一个phase中如果最后一个extension使用了规则3,那么上一个phase结束时的v是一个已经存在的内部结点,从而一定已经有了suffix link;如果使用了规则2, 那么一定是第j=i+1个extension,即上一个phase结束时的v一定是root结点,因为最后一个字符只能添加到root下作为叶子结点,而规定root结点的suffix link永远指向自己),可以在下一个phase中,直接从上一个phase结束时的u开始,而无需考虑上一个phase结束时的v值(因为不需要给上一个phase结束时的v建立suffix link)。
Ukkonen算法构建后缀树的实现:
import java.util.ArrayList; import java.util.LinkedList; import java.util.List; /** * * Build Suffix Tree using Ukkonen Algorithm * * Copyright (c) 2011 ljs (http://blog.csdn.net/ljsspace/) * Licensed under GPL (http://www.opensource.org/licenses/gpl-license.php) * * @author ljs * 2011-07-10 * */ public class Ukkonen { private class SuffixNode { private StringBuilder sb; private List<SuffixNode> children = new LinkedList<SuffixNode>(); private SuffixNode link; private int start; private int end; private int pathlen; public SuffixNode(StringBuilder sb,int start,int end,int pathlen){ this.sb = sb; this.start = start; this.end = end; this.pathlen = pathlen; } public SuffixNode(StringBuilder sb){ this.sb = sb; this.start = -1; this.end = -1; this.pathlen = 0; } public int getLength(){ if(start == -1) return 0; else return end - start + 1; } public String getString(){ if(start != -1){ return this.sb.substring(start,end+1); }else{ return ""; } } public boolean isRoot(){ return start == -1; } public String getCoordinate(){ return "[" + start+".." + end + "/" + this.pathlen + "]"; } public String toString(){ return getString() + "(" + getCoordinate() + ",link:" + ((this.link==null)?"N/A":this.link.getCoordinate()) + ",children:" + children.size() +")"; } } private class State{ private SuffixNode u; //parent(v) //private SuffixNode w; private SuffixNode v; //private int k; //the global index of text starting from 0 to text.length() //private boolean finished; } private SuffixNode root; private StringBuilder sb = new StringBuilder(); public Ukkonen(){ } //build a suffix-tree for a string of text public void buildSuffixTree(String text) throws Exception{ int m = text.length(); if(m==0) return; if(root==null){ root = new SuffixNode(sb); root.link = root; //link to itself } List<SuffixNode> leaves = new ArrayList<SuffixNode>(); //add first node sb.append(text.charAt(0)); SuffixNode node = new SuffixNode(sb,0,0,1); leaves.add(node); root.children.add(node); int j_star = 0; //j_{i-1} SuffixNode u = root; SuffixNode v = root; for(int i=1;i<=m-1;i++){ //do phase i sb.append(text.charAt(i)); //step 1: do implicit extensions for(SuffixNode leafnode:leaves){ leafnode.end++; leafnode.pathlen++; } //step 2: do explicit extensions until rule #3 is applied State state = new State(); //for the first explicit extension, we reuse the last phase's u and do slowscan //also note: suffix link doesn't span two phases. int j=j_star+1; SuffixNode s = u; int k = s.pathlen + j; state.u = s; state.v = s; SuffixNode newleaf = slowscan(state,s,k); if(newleaf == null){ //if rule #3 is applied, then we can terminate this phase j_star = j - 1; //Note: no need to update state.v because it is not going to be used //at the next phase u = state.u; continue; }else{ j_star = j; leaves.add(newleaf); u = state.u; v = state.v; } j++; //for other explicit extensions, we start with fast scan. for(;j<=i;j++){ s = u.link; int uvLen=v.pathlen - u.pathlen; if(u.isRoot() && !v.isRoot()){ uvLen--; } //starting with index k of the text k = s.pathlen + j; //init state state.u = s; state.v = s; //if uvLen = 0 //execute fast scan newleaf = fastscan(state,s,uvLen,k); //establish the suffix link with v v.link = state.v; if(newleaf == null){ //if rule #3 is applied, then we can terminate this phase j_star = j - 1; u = state.u; break; }else{ j_star = j; leaves.add(newleaf); u = state.u; v = state.v; } } } } //slow scan from currNode until state.v is found //return the new leaf if a new one is created right after v; //return null otherwise (i.e. when rule #3 is applied) private SuffixNode slowscan(State state,SuffixNode currNode,int k){ SuffixNode newleaf = null; boolean done = false; int keyLen = sb.length() - k; for(int i=0;i<currNode.children.size();i++){ SuffixNode child = currNode.children.get(i); //use min(child.key.length, key.length) int childKeyLen = child.getLength(); int len = childKeyLen<keyLen?childKeyLen:keyLen; int delta = 0; for(;delta<len;delta++){ if(sb.charAt(k+delta) != sb.charAt(child.start+delta)){ break; } } if(delta==0){//this child doesn't match any character with the new key //order keys by lexi-order if(sb.charAt(k) < sb.charAt(child.start)){ //e.g. child="e" (currNode="abc") // abc abc // / \ =========> / | \ // e f insert "c" c e f int pathlen = sb.length() - k + currNode.pathlen; SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen); currNode.children.add(i,node); //state.u = currNode; //currNode is already registered as state.u, so commented out state.v = currNode; newleaf = node; done = true; break; }else{ //key.charAt(0)>child.key.charAt(0) //don't forget to add the largest new key after iterating all children continue; } }else{//current child's key partially matches with the new key if(delta==len){ if(keyLen==childKeyLen){ //e.g. child="ab" // ab ab // / \ =========> / \ // e f insert "ab" e f //terminate this phase (implicit tree with rule #3) state.u = child; state.v = currNode; }else if(keyLen>childKeyLen){ //TODO: still need an example to test this condition //e.g. child="ab" // ab ab // / \ ==========> / | \ // e f insert "abc" c e f //recursion state.u = child; state.v = child; k += childKeyLen; //state.k = k; newleaf = slowscan(state,child,k); } else{ //keyLen<childKeyLen //e.g. child="abc" // abc abc // / \ =========> / \ // e f insert "ab" e f // //terminate this phase (implicit tree with rule #3) //state.u = currNode; state.v = currNode; } }else{//0<delta<len //e.g. child="abc" // abc ab // / \ ==========> / \ // e f insert "abd" c d // / \ // e f //insert the new node: ab int nodepathlen = child.pathlen - (child.getLength()-delta); SuffixNode node = new SuffixNode(sb, child.start,child.start + delta - 1,nodepathlen); node.children = new LinkedList<SuffixNode>(); int leafpathlen = (sb.length() - (k + delta)) + nodepathlen; SuffixNode leaf = new SuffixNode(sb, k+delta,sb.length()-1,leafpathlen); //update child node: c child.start += delta; if(sb.charAt(k+delta)<sb.charAt(child.start)){ node.children.add(leaf); node.children.add(child); }else{ node.children.add(child); node.children.add(leaf); } //update parent currNode.children.set(i, node); //state.u = currNode; //currNode is already registered as state.u, so commented out state.v = node; newleaf = leaf; } done = true; break; } } if(!done){ int pathlen = sb.length() - k + currNode.pathlen; SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen); currNode.children.add(node); //state.u = currNode; //currNode is already registered as state.u, so commented out state.v = currNode; newleaf = node; } return newleaf; } //fast scan until state.v is found; //return the new leaf if a new one is created right after v; //return null otherwise (i.e. when rule #3 is applied) private SuffixNode fastscan(State state,SuffixNode currNode,int uvLen,int k){ if(uvLen==0){ //state.u = currNode; //currNode is already registered as state.u, so commented out //continue with slow scan return slowscan(state,currNode,k); } SuffixNode newleaf = null; boolean done = false; for(int i=0;i<currNode.children.size();i++){ SuffixNode child = currNode.children.get(i); if(sb.charAt(child.start) == sb.charAt(k)){ int len = child.getLength(); if(uvLen==len){ //then we find v //uvLen = 0; state.u = child; //state.v = child; k += len; //state.k = k; //continue with slow scan newleaf = slowscan(state,child,k); }else if(uvLen<len){ //we know v must be an internal node; branching and cut child short //e.g. child="abc",uvLen = 2 // abc ab // / \ ================> / \ // e f suffix part: "abd" c d // / \ // e f //insert the new node: ab; child is now c int nodepathlen = child.pathlen - (child.getLength()-uvLen); SuffixNode node = new SuffixNode(sb, child.start,child.start + uvLen - 1,nodepathlen); node.children = new LinkedList<SuffixNode>(); int leafpathlen = (sb.length() - (k + uvLen)) + nodepathlen; SuffixNode leaf = new SuffixNode(sb, k+uvLen,sb.length()-1,leafpathlen); //update child node: c child.start += uvLen; if(sb.charAt(k+uvLen)<sb.charAt(child.start)){ node.children.add(leaf); node.children.add(child); }else{ node.children.add(child); node.children.add(leaf); } //update parent currNode.children.set(i, node); //uvLen = 0; //state.u = currNode; //currNode is already registered as state.u, so commented out state.v = node; newleaf = leaf; }else{//uvLen>len //e.g. child="abc", uvLen = 4 // abc // / \ ================> // e f suffix part: "abcde" // // //jump to next node uvLen -= len; state.u = child; //state.v = child; k += len; //state.k = k; newleaf = fastscan(state,child,uvLen,k); } done = true; break; } } if(!done){ //TODO: still need an example to test this condition //add a leaf under the currNode int pathlen = sb.length() - k + currNode.pathlen; SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen); currNode.children.add(node); //state.u = currNode; //currNode is already registered as state.u, so commented out state.v = currNode; newleaf = node; } return newleaf; } //for test purpose only public void printTree(){ System.out.format("The suffix tree for S = %s is: %n",this.sb); this.print(0, this.root); } private void print(int level, SuffixNode node){ for (int i = 0; i < level; i++) { System.out.format(" "); } System.out.format("|"); for (int i = 0; i < level; i++) { System.out.format("-"); } //System.out.format("%s(%d..%d/%d)%n", node.getString(),node.start,node.end,node.pathlen); System.out.format("(%d,%d)%n", node.start,node.end); for (SuffixNode child : node.children) { print(level + 1, child); } } public static void main(String[] args) throws Exception { //test suffix-tree System.out.println("****************************"); String text = "xbxb^"; //the last char must be unique! Ukkonen stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); System.out.println("****************************"); text = "mississippi^"; stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); System.out.println("****************************"); text = "GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^"; stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); System.out.println("****************************"); text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ^"; stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); System.out.println("****************************"); text = "AAAAAAAAAAAAAAAAAAAAAAAAAA^"; stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); System.out.println("****************************"); text = "minimize"; //the last char e is different from other chars, so it is ok. stree = new Ukkonen(); stree.buildSuffixTree(text); stree.printTree(); } }
测试输出:
**************************** The suffix tree for S = xbxb^ is: |(-1,-1) |-(4,4) |-(1,1) |--(4,4) |--(2,4) |-(0,1) |--(4,4) |--(2,4) **************************** The suffix tree for S = mississippi^ is: |(-1,-1) |-(11,11) |-(1,1) |--(11,11) |--(8,11) |--(2,4) |---(8,11) |---(5,11) |-(0,11) |-(8,8) |--(10,11) |--(9,11) |-(2,2) |--(4,4) |---(8,11) |---(5,11) |--(3,4) |---(8,11) |---(5,11) **************************** The suffix tree for S = GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^ is: |(-1,-1) |-(15,15) |--(16,16) |---(17,17) |----(18,18) |-----(35,35) |------(36,36) |-------(37,37) |--------(38,38) |---------(39,39) |----------(40,40) |-----------(41,41) |------------(42,42) |-------------(43,43) |--------------(44,44) |---------------(45,45) |----------------(46,46) |-----------------(47,47) |------------------(48,48) |-------------------(49,49) |--------------------(50,50) |---------------------(51,51) |----------------------(52,53) |----------------------(53,53) |---------------------(53,53) |--------------------(53,53) |-------------------(53,53) |------------------(53,53) |-----------------(53,53) |----------------(53,53) |---------------(53,53) |--------------(53,53) |-------------(53,53) |------------(53,53) |-----------(53,53) |----------(53,53) |---------(53,53) |--------(53,53) |-------(53,53) |------(53,53) |-----(19,53) |-----(53,53) |----(19,53) |----(53,53) |---(19,53) |---(53,53) |--(19,19) |---(27,27) |----(32,53) |----(28,29) |-----(32,53) |-----(30,53) |---(20,20) |----(25,53) |----(21,53) |--(53,53) |-(12,12) |--(15,15) |---(16,53) |---(26,53) |--(13,13) |---(22,53) |---(14,53) |-(0,0) |--(22,22) |---(32,53) |---(23,23) |----(29,29) |-----(32,53) |-----(30,53) |----(24,53) |--(12,12) |---(15,15) |----(16,53) |----(26,53) |---(13,13) |----(22,53) |----(14,53) |--(1,1) |---(12,53) |---(2,2) |----(12,53) |----(3,3) |-----(12,53) |-----(4,4) |------(12,53) |------(5,5) |-------(12,53) |-------(6,6) |--------(12,53) |--------(7,7) |---------(12,53) |---------(8,8) |----------(12,53) |----------(9,9) |-----------(12,53) |-----------(10,10) |------------(12,53) |------------(11,53) |-(53,53) **************************** The suffix tree for S = ABCDEFGHIJKLMNOPQRSTUVWXYZ^ is: |(-1,-1) |-(0,26) |-(1,26) |-(2,26) |-(3,26) |-(4,26) |-(5,26) |-(6,26) |-(7,26) |-(8,26) |-(9,26) |-(10,26) |-(11,26) |-(12,26) |-(13,26) |-(14,26) |-(15,26) |-(16,26) |-(17,26) |-(18,26) |-(19,26) |-(20,26) |-(21,26) |-(22,26) |-(23,26) |-(24,26) |-(25,26) |-(26,26) **************************** The suffix tree for S = AAAAAAAAAAAAAAAAAAAAAAAAAA^ is: |(-1,-1) |-(0,0) |--(1,1) |---(2,2) |----(3,3) |-----(4,4) |------(5,5) |-------(6,6) |--------(7,7) |---------(8,8) |----------(9,9) |-----------(10,10) |------------(11,11) |-------------(12,12) |--------------(13,13) |---------------(14,14) |----------------(15,15) |-----------------(16,16) |------------------(17,17) |-------------------(18,18) |--------------------(19,19) |---------------------(20,20) |----------------------(21,21) |-----------------------(22,22) |------------------------(23,23) |-------------------------(24,24) |--------------------------(25,26) |--------------------------(26,26) |-------------------------(26,26) |------------------------(26,26) |-----------------------(26,26) |----------------------(26,26) |---------------------(26,26) |--------------------(26,26) |-------------------(26,26) |------------------(26,26) |-----------------(26,26) |----------------(26,26) |---------------(26,26) |--------------(26,26) |-------------(26,26) |------------(26,26) |-----------(26,26) |----------(26,26) |---------(26,26) |--------(26,26) |-------(26,26) |------(26,26) |-----(26,26) |----(26,26) |---(26,26) |--(26,26) |-(26,26) **************************** The suffix tree for S = minimize is: |(-1,-1) |-(7,7) |-(1,1) |--(4,7) |--(2,7) |--(6,7) |-(0,1) |--(2,7) |--(6,7) |-(2,7) |-(6,7)