Double-Array Trie分词词典简述

http://www.xuebuyuan.com/1991441.html

一、TRIE树简介（以下简称T树）

TRIE树用于确定词条的快速检索，对于给定的一个字符串a₁,a₂,a₃,…a_n，则采用TRIE

树搜索经过最多n次匹配即可完成一次查找，而与词库中词条的数目无关。它的缺点是空间空闲率高。

二、Double-Array Trie（双数组索引树，以下简称DAT）

1）、DAT简介

DAT是TRIE树的一种变形，它是在保证TRIE树检索速度的前提下，提高空间利用率而提出的一种数据结构。它本质是一个确定的有限状态自动机（DFA），每个节点代表自动机的一个状态，根据变量的不同，进行状态转移，当到达结束状态或者无法转移的时候，完成查询。

2）、DAT结构

DAT是采用两个线性数组（姑且叫它们为base和check数组）进行TRIE树的保存， base和check数组拥有一致的下标，（下标）即DFA中的每一个状态，也即TRIE树中所说的节点，base数组用于确定状态的转移，check数组用于检验转移的正确性。

于是：我们有如下

[定义1]：从状态s输入c到状态t的一个转移必须满足如下条件

base[s] + c == t
check[base[s] + c] == s

3)、DAT匹配

基于[定义1] DAT的匹配过程如下：

假设当前状态为s，输入字符为c。

t = base[s] + c;

if check[t] = s then

next state = t;

else

fail;

endif

3）、DAT构造

基于[定义1] DAT的构造过程如下：

root_index = 1;

Procedure daInsertBranch(String key)
begin
   index = root_index;
   for i = 0 to key.length()
   begin
      character c = key.get(i)
      t = base[index] + c;       1
           [ 。。。此处执行冲突处理。。。]
      check[t] = index;           2
      index = t;
   end
   base[t] *= -1;
end

4)、DAT冲突处理

在执行3的过程中，有可能在1处插入状态t时该位置已经被其他状态 t1所占用，这就产生了冲突。

解决冲突的基本思想是为t以及t的所有兄弟状态重新寻找一个合适的状态，相当于寻找一个合适的数组下标。

//  寻找适当的base值，也相当于为所有子状态寻找合适的下标
Procedure intdaFindBase(character c, int oldbase_index)
begin
   if check[ base[oldbase_index] + c ] != 0 then
   begin
      foreach character a in ALPHABET(字母表)
      begin
        if check[ base[oldbase_index] + a ] != 0 then
              Add a to child_list;
      end
      Add c to child_list;
      base[oldbase_index]++;
while ( not fit each character )
begin
        base[oldbase_index]++;
end
   end
   return base[oldbase_index];
end

// 重新分配
Procedure intdaRelocateBase (int old_index, int new_index)
begin
    //拷贝所有节点到新的位置，并修改被拷贝节点的所有子节点的check值以保证
    //在移动之后仍然是其子节点
    foreach character c in child_list
        begin
            copy cell from old_index to new_index
            begin
               get all childs of old_index;
               check[child] = new_index;
          end
          //释放所有旧的节点
         free old_index cell;
      end
     base[oldbase_index] = newbase;
end

冲突处理位于3）构造中的 2 前面

posted @ 2014-11-03 18:21 Django's blog 阅读(563) 评论(0) 编辑收藏举报

刷新页面返回顶部

Django's blog

Double-Array Trie分词词典简述

公告