构建后缀树的Ukkonen算法及其实现



Ukkonen算法(简称ukk算法)是一个online算法,它与mcc算法的一个显著区别是每次只对S的一个前缀生成隐式后缀树(implicit suffix tree),然后考虑S的下一个字符S[i+1]并将S[0...i+1]的所有后缀加入到上一个阶段中生成的隐式后缀树中,形成一个新的隐式后缀树。最后用一个特殊字符将隐式后缀树自动转换成真实的后缀树。这样ukk的一个最大优点就是不需要事先知道输入字串的全部内容,只需使用增量方式生成后缀树。和mcc算法类似,也是采用压缩存储Trie,以达到节省空间的目的。通过使用implicit extensions和suffix link两大技巧,时间复杂度可以达到线性。

*****************************名词术语*****************************
为便于理解算法,先介绍几个名词。

隐式后缀树(implicit suffix tree):后缀可能终止于叶子结点,也可能隐藏在内部结点中。如果输入串中最后一个字符不同于其他字符,那么所有的后缀都终止于叶子结点,不会有后缀隐藏在内部结点中。这就是为什么ukk算法中最后一个字符必须是特殊字符的原因。

阶段(phase):在阶段i+1中,将考虑S[i+1]进来并将S[0...i+1]的所有后缀加入到上一个阶段i生成的隐式后缀树中,形成一个新的隐式后缀树。

扩展(extension):在每个阶段中,需要将每一个后缀加入到上一个阶段的隐式后缀树中,每个后缀加入操作叫做extension j。例如要生成mississippi$的后缀树,在phase i+1=4时,需要将后缀missi,issi,ssi,si,i分别加入到phase 3生成的隐式后缀树中,换句话说,这个阶段中共有5个extensions,分别将子串S[j...4]加入到phase 3生成的隐式后缀树中,其中j的取值范围为[0...4],。

后缀链(suffix link):参考前面关于mcc算法的博文(注意:只有内部结点才有suffix link,叶子结点无需suffix link)。

*****************************规则1,2,3*****************************

生成隐式后缀树的过程,需要依照三个规则来处理每一个extension(以下叙述假设针对phase i+1):
规则1)如果S[j...i]的最后一个字符在叶子结点中,那么直接将S[i+1]附加到S[j...i]后面。

规则2)如果S[j...i]的最后一个字符在不在叶子结点中,而且当前的隐式后缀树中该路径上的下一个字符c不等于S[i+1],那么需要生成一个新的结点。这个规则也包括在root结点产生一个叶子结点的情况。

规则3)如果在规则2中,下一个字符c等于S[i+1],则不需要做任何操作,因为当前后缀S[j...i+1]已经存在于隐式后缀树中。

*****************************ukk算法复杂度能达到O(n)的原因*****************************
如果不是用implicit extensions和suffix link,ukk算法的复杂度实际上为O(n^3),这比蛮力法的性能还差,显然这是不能接受的。由于有implicit extensions和压缩Trie结构,在增量处理时可以对已经是叶子的结点进行简化(采用规则1)。在每一个extension中针对规则2和规则3的情况,通过suffix link可以直接跳到某个结点,而不需要从root结点搜索,另外一旦应用规则3(原因是由后缀树的特点造成的),则整个phase就可以提前结束(这叫做"show-stopper"技巧)。这两大技巧使得ukk算法的平摊复杂度达到线性。

由于产生叶子结点的来源只有通过规则2这一条渠道,而且叶子结点永远都是叶子结点("once a leaf, always a leaf"),因此当前phase中如果有应用规则2产生的叶子结点,那么它们会加入到下一个phase中的叶子结点集合中,在下一个阶段继续应用规则1(属于implicit extension),而不是规则2。设当前phase中最后一个应用规则2产生的叶子结点的extension编号为变量j_star,在下一个phase中应用规则2和规则3的结点只需要从j_star+1开始。由于所有的叶子结点属于implicit extensions,可以简化操作,只有规则2需要采用类似mcc算法的fastscan和slowscan找到u和v(以及建立上一个v到当前v的suffix link)。而在每一个phase中规则3的应用机会最多只有一次。

在具体实现中,考虑到建立suffix link不会跨越相邻两个phases(因为在上一个phase中如果最后一个extension使用了规则3,那么上一个phase结束时的v是一个已经存在的内部结点,从而一定已经有了suffix link;如果使用了规则2, 那么一定是第j=i+1个extension,即上一个phase结束时的v一定是root结点,因为最后一个字符只能添加到root下作为叶子结点,而规定root结点的suffix link永远指向自己),可以在下一个phase中,直接从上一个phase结束时的u开始,而无需考虑上一个phase结束时的v值(因为不需要给上一个phase结束时的v建立suffix link)。

Ukkonen算法构建后缀树的实现:

import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
 
/**
 * 
 * Build Suffix Tree using Ukkonen Algorithm
 *  
 * Copyright (c) 2011 ljs (http://blog.csdn.net/ljsspace/)
 * Licensed under GPL (http://www.opensource.org/licenses/gpl-license.php) 
 * 
 * @author ljs
 * 2011-07-10
 *
 */
public class Ukkonen {
	private class SuffixNode {		
		private StringBuilder sb;
		
	    private List<SuffixNode> children = new LinkedList<SuffixNode>();
	    
	    private SuffixNode link;
	    private int start;
	    private int end;
	    private int pathlen;
	    
	    public SuffixNode(StringBuilder sb,int start,int end,int pathlen){	
	    	this.sb = sb;
	    	this.start = start;
	    	this.end = end;
	    	this.pathlen = pathlen;
	    }
	    public SuffixNode(StringBuilder sb){	    
	    	this.sb = sb;
	    	this.start = -1;
	    	this.end = -1;	    
	    	this.pathlen = 0;
	    }
	    public int getLength(){
	    	if(start == -1) return 0;
	    	else return end - start + 1;
	    }
	    public String getString(){
	    	if(start != -1){
	    		return this.sb.substring(start,end+1);
	    	}else{
	    		return "";
	    	}
	    }
	    public boolean isRoot(){
	    	return start == -1;
	    }
	    public String getCoordinate(){
	    	return "[" + start+".." + end + "/" + this.pathlen + "]";
	    }
	    public String toString(){	    	
	    	return getString() + "(" + getCoordinate() 
	    		+ ",link:" + ((this.link==null)?"N/A":this.link.getCoordinate()) 
	    		+ ",children:" + children.size() +")";
	    }	   
	}
	private class State{
		private SuffixNode u; //parent(v)
		//private SuffixNode w;  
		private SuffixNode v;  
		//private int k; //the global index of text starting from 0 to text.length()
		//private boolean finished;  
	}
	
	private SuffixNode root;
	private StringBuilder sb = new StringBuilder();
	
	public Ukkonen(){
		
	}
	
	//build a suffix-tree for a string of text
	public void  buildSuffixTree(String text) throws Exception{	
		int m = text.length();
		
		if(m==0)
			return;
		
		if(root==null){
			root = new SuffixNode(sb);		
			root.link = root; //link to itself
		}
		
		List<SuffixNode> leaves =  new ArrayList<SuffixNode>();
		
		//add first node
		sb.append(text.charAt(0));
		SuffixNode node = new SuffixNode(sb,0,0,1);
		leaves.add(node);
		root.children.add(node);	
		int j_star = 0; //j_{i-1}
		
		SuffixNode u = root;
		SuffixNode v = root;			
		for(int i=1;i<=m-1;i++){			
			//do phase i
			sb.append(text.charAt(i));
			
			//step 1: do implicit extensions 
			for(SuffixNode leafnode:leaves){
				leafnode.end++;
				leafnode.pathlen++;
			}
			
			//step 2: do explicit extensions until rule #3 is applied			
			State state = new State();	
			
			//for the first explicit extension, we reuse the last phase's u and do slowscan
			//also note: suffix link doesn't span two phases.
			int j=j_star+1;
			SuffixNode s = u;		 
			int k = s.pathlen + j;		
			state.u = s;			
			state.v = s;  
			SuffixNode newleaf = slowscan(state,s,k);
			if(newleaf == null){
				//if rule #3 is applied, then we can terminate this phase
				j_star = j - 1;
				//Note: no need to update state.v because it is not going to be used
				//at the next phase
				u = state.u;
				continue;
			}else{			
				
				j_star = j;
				leaves.add(newleaf);
				
				u = state.u;
				v = state.v;
			}		
			j++;
			
			//for other explicit extensions, we start with fast scan.
			for(;j<=i;j++){
				s = u.link;
				
				int uvLen=v.pathlen - u.pathlen;  		
				if(u.isRoot() && !v.isRoot()){
					uvLen--;
				}
				//starting with index k of the text
				k = s.pathlen + j;		
				
				
				//init state
				state.u = s;			
				state.v = s; //if uvLen = 0 
				
				//execute fast scan
				newleaf = fastscan(state,s,uvLen,k);				
				//establish the suffix link with v		
				v.link = state.v;
				
				if(newleaf == null){
					//if rule #3 is applied, then we can terminate this phase
					j_star = j - 1;
					u = state.u;
					break;
				}else{
					
					j_star = j;
					leaves.add(newleaf);
					
					u = state.u;
					v = state.v;
				}			
			}
		}
	}
	//slow scan from currNode until state.v is found
	//return the new leaf if a new one is created right after v;
	//return null otherwise (i.e. when rule #3 is applied)
	private SuffixNode slowscan(State state,SuffixNode currNode,int k){
		SuffixNode newleaf = null;
		
		boolean done = false;		
		int keyLen = sb.length() - k;
		for(int i=0;i<currNode.children.size();i++){
			SuffixNode child = currNode.children.get(i);
			
			//use min(child.key.length, key.length)			
			int childKeyLen = child.getLength();
			int len = childKeyLen<keyLen?childKeyLen:keyLen;
			int delta = 0;
			for(;delta<len;delta++){
				if(sb.charAt(k+delta) != sb.charAt(child.start+delta)){
					break;
				}
			}
			if(delta==0){//this child doesn't match	any character with the new key			
				//order keys by lexi-order
				if(sb.charAt(k) < sb.charAt(child.start)){
					//e.g. child="e" (currNode="abc")
					//	   abc                     abc
					//    /  \    =========>      / | \
					//   e    f   insert "c"     c  e  f
					int pathlen = sb.length() - k + currNode.pathlen;
					SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen);
					currNode.children.add(i,node);		
					//state.u = currNode; //currNode is already registered as state.u, so commented out
					state.v = currNode;
					newleaf = node;
					done = true;
					break;					
				}else{ //key.charAt(0)>child.key.charAt(0)
					//don't forget to add the largest new key after iterating all children
					continue;
				}
			}else{//current child's key partially matches with the new key	
				if(delta==len){
					if(keyLen==childKeyLen){						
						//e.g. child="ab"
						//	   ab                    ab
						//    /  \    =========>    /  \
						//   e    f   insert "ab"  e    f
						//terminate this phase  (implicit tree with rule #3)		
						state.u = child;
						state.v = currNode;
					}else if(keyLen>childKeyLen){ 
						//TODO: still need an example to test this condition
						//e.g. child="ab"
						//	   ab                      ab
						//    /  \    ==========>     / | \ 							
						//   e    f   insert "abc"   c e  f		
						//recursion
						state.u = child;
						state.v = child;
						k += childKeyLen;
						//state.k = k;
						newleaf = slowscan(state,child,k);
					}
					else{ //keyLen<childKeyLen
						//e.g. child="abc"
						//	   abc                      abc
						//    /   \      =========>     /  \ 
						//   e     f     insert "ab"   e   f	   
						//					          
						//terminate this phase  (implicit tree with rule #3)
						//state.u = currNode;
						state.v = currNode;
					}
				}else{//0<delta<len 
			
					//e.g. child="abc"
					//	   abc                     ab
					//    /  \     ==========>     / \
					//   e    f   insert "abd"    c   d 
					//                           /  \
					//                          e    f					
					//insert the new node: ab 
					int nodepathlen = child.pathlen 
							- (child.getLength()-delta);
					SuffixNode node = new SuffixNode(sb,
							child.start,child.start + delta - 1,nodepathlen); 
					node.children = new LinkedList<SuffixNode>();
					
					int leafpathlen = (sb.length() - (k + delta)) + nodepathlen;
					SuffixNode leaf = new SuffixNode(sb,
							k+delta,sb.length()-1,leafpathlen);
					
					//update child node: c
					child.start += delta;
					if(sb.charAt(k+delta)<sb.charAt(child.start)){
						node.children.add(leaf);
						node.children.add(child);
					}else{
						node.children.add(child);
						node.children.add(leaf);							
					}
					//update parent
					currNode.children.set(i, node);
					
					//state.u = currNode; //currNode is already registered as state.u, so commented out
					state.v = node;
					newleaf = leaf;			
				}
				done = true;
				break;
			}
		}
		if(!done){
			int pathlen = sb.length() - k + currNode.pathlen;
			SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen);
			currNode.children.add(node);
			//state.u = currNode; //currNode is already registered as state.u, so commented out
			state.v = currNode;	
			newleaf = node;
		}
		
		return newleaf;
	}
	
	
	//fast scan until state.v is found;
	//return the new leaf if a new one is created right after v;
	//return null otherwise (i.e. when rule #3 is applied)
	private SuffixNode fastscan(State state,SuffixNode currNode,int uvLen,int k){				
		if(uvLen==0){
			//state.u = currNode; //currNode is already registered as state.u, so commented out
			//continue with slow scan
			return slowscan(state,currNode,k);	
		}
		
		SuffixNode newleaf = null;
		boolean done  = false;
		for(int i=0;i<currNode.children.size();i++){
			SuffixNode child = currNode.children.get(i);
			
			if(sb.charAt(child.start) == sb.charAt(k)){				
				int len = child.getLength();
				if(uvLen==len){
					//then we find v			
					//uvLen = 0;					
					state.u = child;	
					//state.v = child;
					k += len;
					//state.k = k;
					
					//continue with slow scan
					newleaf = slowscan(state,child,k);					
				}else if(uvLen<len){
					//we know v must be an internal node; branching	and cut child short								
					//e.g. child="abc",uvLen = 2
					//	   abc                          ab
					//    /  \    ================>     / \
					//   e    f   suffix part: "abd"   c   d 
					//                                /  \
					//                               e    f				
					
					//insert the new node: ab; child is now c 
					int nodepathlen = child.pathlen 
							- (child.getLength()-uvLen);
					SuffixNode node = new SuffixNode(sb,
							child.start,child.start + uvLen - 1,nodepathlen); 
					node.children = new LinkedList<SuffixNode>();
					
					int leafpathlen = (sb.length() - (k + uvLen)) + nodepathlen;
					SuffixNode leaf = new SuffixNode(sb,
							k+uvLen,sb.length()-1,leafpathlen);
					
					//update child node: c
					child.start += uvLen;
					if(sb.charAt(k+uvLen)<sb.charAt(child.start)){
						node.children.add(leaf);
						node.children.add(child);
					}else{
						node.children.add(child);
						node.children.add(leaf);							
					}
			
					//update parent
					currNode.children.set(i, node);
					
					//uvLen = 0;
					//state.u = currNode; //currNode is already registered as state.u, so commented out
					state.v = node;				
					newleaf = leaf;
				}else{//uvLen>len
					//e.g. child="abc", uvLen = 4
					//	   abc                          
					//    /  \    ================>      
					//   e    f   suffix part: "abcde"   
					//                                
					//                  
					//jump to next node
					uvLen -= len;
					state.u = child;
					//state.v = child;
					k += len;
					//state.k = k;
					newleaf = fastscan(state,child,uvLen,k);
				}
				done = true;
				break;
			}
		}		
		if(!done){			
			//TODO: still need an example to test this condition
			//add a leaf under the currNode
			int pathlen = sb.length() - k + currNode.pathlen;
			SuffixNode node = new SuffixNode(sb,k,sb.length()-1,pathlen);
			currNode.children.add(node);
			//state.u = currNode; //currNode is already registered as state.u, so commented out
			state.v = currNode;	
			newleaf = node;
		}
		
		return newleaf;
	}
	
	//for test purpose only
	public void printTree(){
		System.out.format("The suffix tree for S = %s is: %n",this.sb);
		this.print(0, this.root);
	}
	private void print(int level, SuffixNode node){
		for (int i = 0; i < level; i++) {
            System.out.format(" ");
        }
		System.out.format("|");
        for (int i = 0; i < level; i++) {
        	System.out.format("-");
        }
        //System.out.format("%s(%d..%d/%d)%n", node.getString(),node.start,node.end,node.pathlen);
        System.out.format("(%d,%d)%n", node.start,node.end);
        for (SuffixNode child : node.children) {
        	print(level + 1, child);
        }		
	}
	public static void main(String[] args) throws Exception {
		//test suffix-tree
		System.out.println("****************************");		
		String text = "xbxb^"; //the last char must be unique!
		Ukkonen stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();
		
		System.out.println("****************************");		
		text = "mississippi^";
		stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();
		
		System.out.println("****************************");		
		text = "GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^";
		stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();
		
		System.out.println("****************************");		
		text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ^";
		stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();

		System.out.println("****************************");		
		text = "AAAAAAAAAAAAAAAAAAAAAAAAAA^";
		stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();
		
		System.out.println("****************************");		
		text = "minimize";  //the last char e is different from other chars, so it is ok.
		stree = new Ukkonen();
		stree.buildSuffixTree(text);
		stree.printTree();
		
		
		
	}
}

测试输出:

****************************
The suffix tree for S = xbxb^ is: 
|(-1,-1)
 |-(4,4)
 |-(1,1)
  |--(4,4)
  |--(2,4)
 |-(0,1)
  |--(4,4)
  |--(2,4)
****************************
The suffix tree for S = mississippi^ is: 
|(-1,-1)
 |-(11,11)
 |-(1,1)
  |--(11,11)
  |--(8,11)
  |--(2,4)
   |---(8,11)
   |---(5,11)
 |-(0,11)
 |-(8,8)
  |--(10,11)
  |--(9,11)
 |-(2,2)
  |--(4,4)
   |---(8,11)
   |---(5,11)
  |--(3,4)
   |---(8,11)
   |---(5,11)
****************************
The suffix tree for S = GGGGGGGGGGGGCGCAAAAGCGAGCAGAGAGAAAAAAAAAAAAAAAAAAAAAA^ is: 
|(-1,-1)
 |-(15,15)
  |--(16,16)
   |---(17,17)
    |----(18,18)
     |-----(35,35)
      |------(36,36)
       |-------(37,37)
        |--------(38,38)
         |---------(39,39)
          |----------(40,40)
           |-----------(41,41)
            |------------(42,42)
             |-------------(43,43)
              |--------------(44,44)
               |---------------(45,45)
                |----------------(46,46)
                 |-----------------(47,47)
                  |------------------(48,48)
                   |-------------------(49,49)
                    |--------------------(50,50)
                     |---------------------(51,51)
                      |----------------------(52,53)
                      |----------------------(53,53)
                     |---------------------(53,53)
                    |--------------------(53,53)
                   |-------------------(53,53)
                  |------------------(53,53)
                 |-----------------(53,53)
                |----------------(53,53)
               |---------------(53,53)
              |--------------(53,53)
             |-------------(53,53)
            |------------(53,53)
           |-----------(53,53)
          |----------(53,53)
         |---------(53,53)
        |--------(53,53)
       |-------(53,53)
      |------(53,53)
     |-----(19,53)
     |-----(53,53)
    |----(19,53)
    |----(53,53)
   |---(19,53)
   |---(53,53)
  |--(19,19)
   |---(27,27)
    |----(32,53)
    |----(28,29)
     |-----(32,53)
     |-----(30,53)
   |---(20,20)
    |----(25,53)
    |----(21,53)
  |--(53,53)
 |-(12,12)
  |--(15,15)
   |---(16,53)
   |---(26,53)
  |--(13,13)
   |---(22,53)
   |---(14,53)
 |-(0,0)
  |--(22,22)
   |---(32,53)
   |---(23,23)
    |----(29,29)
     |-----(32,53)
     |-----(30,53)
    |----(24,53)
  |--(12,12)
   |---(15,15)
    |----(16,53)
    |----(26,53)
   |---(13,13)
    |----(22,53)
    |----(14,53)
  |--(1,1)
   |---(12,53)
   |---(2,2)
    |----(12,53)
    |----(3,3)
     |-----(12,53)
     |-----(4,4)
      |------(12,53)
      |------(5,5)
       |-------(12,53)
       |-------(6,6)
        |--------(12,53)
        |--------(7,7)
         |---------(12,53)
         |---------(8,8)
          |----------(12,53)
          |----------(9,9)
           |-----------(12,53)
           |-----------(10,10)
            |------------(12,53)
            |------------(11,53)
 |-(53,53)
****************************
The suffix tree for S = ABCDEFGHIJKLMNOPQRSTUVWXYZ^ is: 
|(-1,-1)
 |-(0,26)
 |-(1,26)
 |-(2,26)
 |-(3,26)
 |-(4,26)
 |-(5,26)
 |-(6,26)
 |-(7,26)
 |-(8,26)
 |-(9,26)
 |-(10,26)
 |-(11,26)
 |-(12,26)
 |-(13,26)
 |-(14,26)
 |-(15,26)
 |-(16,26)
 |-(17,26)
 |-(18,26)
 |-(19,26)
 |-(20,26)
 |-(21,26)
 |-(22,26)
 |-(23,26)
 |-(24,26)
 |-(25,26)
 |-(26,26)
****************************
The suffix tree for S = AAAAAAAAAAAAAAAAAAAAAAAAAA^ is: 
|(-1,-1)
 |-(0,0)
  |--(1,1)
   |---(2,2)
    |----(3,3)
     |-----(4,4)
      |------(5,5)
       |-------(6,6)
        |--------(7,7)
         |---------(8,8)
          |----------(9,9)
           |-----------(10,10)
            |------------(11,11)
             |-------------(12,12)
              |--------------(13,13)
               |---------------(14,14)
                |----------------(15,15)
                 |-----------------(16,16)
                  |------------------(17,17)
                   |-------------------(18,18)
                    |--------------------(19,19)
                     |---------------------(20,20)
                      |----------------------(21,21)
                       |-----------------------(22,22)
                        |------------------------(23,23)
                         |-------------------------(24,24)
                          |--------------------------(25,26)
                          |--------------------------(26,26)
                         |-------------------------(26,26)
                        |------------------------(26,26)
                       |-----------------------(26,26)
                      |----------------------(26,26)
                     |---------------------(26,26)
                    |--------------------(26,26)
                   |-------------------(26,26)
                  |------------------(26,26)
                 |-----------------(26,26)
                |----------------(26,26)
               |---------------(26,26)
              |--------------(26,26)
             |-------------(26,26)
            |------------(26,26)
           |-----------(26,26)
          |----------(26,26)
         |---------(26,26)
        |--------(26,26)
       |-------(26,26)
      |------(26,26)
     |-----(26,26)
    |----(26,26)
   |---(26,26)
  |--(26,26)
 |-(26,26)
****************************
The suffix tree for S = minimize is: 
|(-1,-1)
 |-(7,7)
 |-(1,1)
  |--(4,7)
  |--(2,7)
  |--(6,7)
 |-(0,1)
  |--(2,7)
  |--(6,7)
 |-(2,7)
 |-(6,7)

posted @ 2011-07-10 22:17  ljsspace  阅读(2729)  评论(2编辑  收藏  举报