Contents

  1. What is Trie?
  2. What Does It Take to Implement a Trie?
  3. Tripple-Array Trie
  4. Double-Array Trie
  5. Suffix Compression
  6. Key Insertion
  7. Key Deletion
  8. Double-Array Pool Allocation
  9. An Implementation
  10. Download
  11. Other Implementations
  12. References

What is Trie?

Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.) [Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".

Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See [Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, one by one character in the key string is taken to determine the next state to go. The edge labeled with the same character is chosen to walk. Notice that each step of such walking consumes one character from the key and descends one step down the tree. If the key is exhausted and a leaf node is reached, then we arrive at the exit for that key. If we get stuck at some node, either because there is no branch labeled with the current character we have or because the key is exhausted at an internal node, then it simply implies that the key is not recognized by the trie.

Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.

In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of one node without consuming a character from the key, we can fix the deletion typo, or even substitution typo if we just drop the key character that has no branch to go and descend to all the immediate children of the current node.

What Does It Take to Implement a Trie?

In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.

This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have only a few branches, leaving the majority of the table cells blanks.

Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.

Hence, table compression techniques which still allows fast access have been devised to solve the problem.

  1. [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.
  2. [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.

Tripple-Array Trie

As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.

Structure

The tripple-array structure is composed of:

  1. base. Each element in base corresponds to a node of the trie. For a trie node s, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.
  2. next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.
  3. check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.

Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:

check[base[s] + c] = s
next[base[s] + c]
= t

Walking

According to definition 1, the walking algorithm for a
given state s and the input character c is:

t := base[s] + c;
if check[t] = s then
next state := next[t]
else
fail
endif

Construction

To insert a transition that takes character c to traverse
from a state s to another state t, the cell
next[base[s] + c]]
must be managed to be available. If it is already vacant, we are lucky.
Otherwise, either the entire transition vector for the current owner of the
cell or that of the state s itself must be relocated. The
estimated cost for each case could determine which one to move. After finding
the free slots to place the vector, the transition vector must be
recalculated as follows. Assuming the new place begins at b,
the procedure for the relocation is:

Procedure Relocate(s : state; b : base_index)
{ Move base for state s to a new place beginning at b }
begin
foreach input character c for the state s
{ i.e. foreach c such that check[base[s] + c]] = s }
begin
check[b + c] := s; { mark owner }
next[b + c] := next[base[s] + c]; { copy data }
check[base[s] + c] := none { free the cell }
end;
base[s] := b
end

Double-Array Trie

The tripple-array structure for implementing trie appears to be well defined,
but is still not practical to keep in a single file. The
next/check
pool may be able to keep in a single array of integer couples, but the
base array does not grow in parallel to the pool, and is therefore
usually split.

To solve this problem, [Aoe1989] reduced the
structure into two parallel arrays. In the double-array structure, the
base and next are merged, resulting in only two
parallel arrays, namely, base and check.

Structure

Instead of indirectly referencing through state numbers as
in tripple-array trie, nodes in double-array trie are linked directly within
the base/check pool.

Definition 2. For a transition from state s to
t which takes character c as the input, the condition
maintained in the double-array trie is:

check[base[s] + c]
= s
base[s] + c
= t

Walking

According to definition 2, the walking algorithm for a
given state s and the input character c is:

t := base[s] + c;
if check[t] = s then
next state := t
else
fail
endif

Construction

The construction of double-array trie is in principle the same as that of
tripple-array trie. The difference is the base relocation:

Procedure Relocate(s : state; b : base_index)
{ Move base for state s to a new place beginning at b }
begin
foreach input character c for the state s
{ i.e. foreach c such that check[base[s] + c]] = s }
begin
check[b + c] := s; { mark owner }
base[b + c] := base[base[s] + c]; { copy data }
{ the node base[s] + c is to be moved to b + c;
Hence, for any i for which check[i] = base[s] + c, update check[i] to b + c }
foreach input character d for the node base[s] + c
begin
check[base[base[s] + c] + d] := b + c
end;
check[base[s] + c] := none { free the cell }
end;
base[s] := b
end

Suffix Compression

[Aoe1989] also suggested a storage compression
strategy, by splitting non-branching suffixes into single string storages,
called tail, so that the rest non-branching steps are reduced
into mere string comparison.

With the two separate data structures, double-array branches and
suffix-spool tail, key insertion and deletion algorithms must be modified
accordingly.

Key Insertion

To insert a new key, the branching position can be found by traversing the
trie with the key one by one character until it gets stuck. The state where
there is no branch to go is the very place to insert a new edge, labeled by
the failing character. However, with the branch-tail structure, the insertion
point can be either in the branch or in the tail.

1. When the branching point is in the double-array structure

Suppose that the new key is a string
a1a2...ah-1ahah+1...an,
where
a1a2...ah-1
traverses the trie from the root to a node sr in the double-array
structure, and there is no edge labeled ah that goes out of
sr. The algorithm called A_INSERT in
[Aoe1989] does as follows:

From sr, insert edge labeled ah to new node st;
Let st be a separate node poining to a string ah+1...an in tail pool.

2. When the branching point is in the tail pool

Since the path through a tail string has no branch, and therefore corresponds
to exactly one key, suppose that the key corresponding to the tail is

a1a2...ah-1ah...ah+k-1b1...bm,

where
a1a2...ah-1 is in double-array structure, and
ah...ah+k-1b1...bm is in tail.
Suppose that the substring
a1a2...ah-1 traverses the trie from the root
to a node sr.

And suppose that the new key is in the form

a1a2...ah-1ah...ah+k-1ah+k...an,

where ah+k <> b1. The algorithm called
B_INSERT in [Aoe1989] does as follows:

From sr, insert straight path with ah...ah+k-1, ending at a new node st;
From st, insert edge labeled b1 to new node su;
Let su be separate node pointing to a string b2...bm in tail pool;
From st, insert edge labeled ah+k to new node sv;
Let sv be separate node pointing to a string ah+k+1...an in tail pool.

Key Deletion

To delete a key from the trie, all we need to do is delete the tail block
occupied by the key, and all double-array nodes belonging exclusively to the
key, without touching any node belonging to other keys.

Consider a trie which accepts a language K = {pool#, prepare#, preview#,
prize#, produce#, producer#, progress#} :

The key "pool#" can be deleted by removing the tail string "ol#" from the
tail pool, and node 3 from the double-array structure. This is the simplest
case.

To remove the key "produce#", it is sufficient to delete node 14 from the
double-array structure. But the resulting trie will not obay the convention
that every node in the double-array structure, except the separate nodes which
point to tail blocks, must belong to more than one key. The path from node 10
on will belong solely to the key "producer#".

But there is no harm violating this rule. The only drawback is the
uncompactnesss of the trie. Traversal, insertion and deletion algoritms are
intact. Therefore, this should be relaxed, for the sake of simplicity and
efficiency of the deletion algorithm. Otherwise, there must be extra steps
to examine other keys in the same subtree ("producer#" for the deletion of
"produce#") if any node needs to be moved from the double-array structure to
tail pool.

Suppose further that having removed "produce#" as such (by removing only
node 14), we also need to remove "producer#" from the trie. What we have to do
is remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which now
belong solely to the key "producer#") from the double-array structure.

We can thus summarize the algorithm to delete a key
k = a1a2...ah-1ah...an,
where a1a2...ah-1 is in double-array structure,
and ah...an is in tail pool, as follows :

Let sr := the node reached by a1a2...ah-1;
Delete ah...an from tail;
s := sr;
repeat
p := parent of s;
Delete node s from double-array structure;
s := p
until s = root or outdegree(s) > 0.

Where outdegree(s) is the number of children nodes
of s.

Double-Array Pool Allocation

When inserting a new branch for a node, it is possible that the array element
for the new branch has already been allocated to another node. In that case,
relocation is needed. The efficiency-critical part then turns out to be the
search for a new place. A brute force algoritm iterates along the
check array to find an empty cell to place the first branch, and
then assure that there are empty cells for all other branches as well.
The time used is therefore proportional to the size of the double-array pool
and the size of the alphabet.

Suppose that there are n nodes in the trie, and the alphabet is
of size m. The size of the double-array structure would be
n + cm, where c is a coefficient which
is dependent on the characteristic of the trie. And the time complexity of
the brute force algorithm would be
O(nm + cm2).

[Aoe1989] proposed a free-space list in the
double-array structure to make the time complexity independent of the size
of the trie, but dependent on the number of the free cells only. The
check array for the free cells are redefined to keep a pointer
to the next free cell (called G-link) :

Definition 3. Let r1, r2, ... ,
rcm be the free cells in the double-array structure, ordered
by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1
; 1 <= i <= cm-1
check[rcm] = -1

By this definition, negative check means unoccupied in the same
sense as that for "none" check in the ordinary algorithm. This
encoding scheme forms a singly-linked list of free cells. When searching for an
empty cell, only cm free cells are visited, instead of all
n + cm cells as in the brute force algorithm.

This, however, can still be improved. Notice that for those cells with
negative check, the corresponding base's are not
given any definition. Therefore, in our implementation, Aoe's G-link is
modified to be doubly-linked list by letting base of every free
cell points to a previous free cell. This can speed up the insertion and
deletion processes. And, for convenience in referencing the list head and tail,
we let the list be circular. The zeroth node is dedicated to be the entry point
of the list. And the root node of the trie will begin with cell number one.

Definition 4. Let r1, r2, ... ,
rcm be the free cells in the double-array structure, ordered
by position. G-link is defined as follows :

check[0] = -r1
check[ri] = -ri+1
; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri
; 1 <= i <= cm-1

Then, the searching for the slots for a node with input symbol set
P = {c1, c2, ..., cp} needs to iterate only
the cells with negative check :

{find least free cell s such that s > c1}
s := -check[0];
while s <> 0 and s <= c1 do
s := -check[s]
end;
if s = 0 then return FAIL; {or reserve some additional space}

{continue searching for the row, given that s matches c1}
while s <> 0 do
i := 2;
while i <= p and check[s + ci - c1] < 0 do
i := i + 1
end;
if i = p + 1 then return s - c1; {all cells required are free, so return it}
s := -check[s]
end;
return FAIL; {or reserve some additional space}

The time complexity for free slot searching is reduced to
O(cm2). The relocation stage takes
O(m2). The total time complexity is therefore
O(cm2 + m2) =
O(cm2).

It is useful to keep the free list ordered by position, so that the access
through the array becomes more sequential. This would be beneficial when the
trie is stored in a disk file or virtual memory, because the disk caching or
page swapping would be used more efficiently. So, the free cell reusing
should maintain this strategy :

t := -check[0];
while check[t] <> 0 and t < s do
t := -check[t]
end;
{t now points to the cell after s' place}
check[s] := -t;
check[-base[t]] := -s;
base[s] := base[t];
base[t] := -s;

Time complexity of freeing a cell is thus O(cm).

An Implementation

In my implementation, I designed the API with persistent data in mind.
Tries can be saved to disk and loaded for use afterward. And in newer versions,
non-persistent usage is also possible. You can create a trie in memory,
populate data to it, use it, and free it, without any disk I/O. Alternatively
you can load a trie from disk and save it to disk whenever you want.

The trie data is portable across platforms. The byte order in the disk is
always little-endian, and is read correctly on either little-endian or
big-endian systems.

Trie index is 32-bit signed integer. This allows 2,147,483,646
(231 - 2) total nodes in the trie data, which should be sufficient
for most problem domains. And each data entry can store a 32-bit integer value
associated to it. This value can be used for any purpose, up to your needs.
If you don't need to use it, just store some dummy value.

For sparse data compactness, the trie alphabet set should be continuous,
but that is usually not the case in general character sets. Therefore, a map
between the input character and the low-level alphabet set for the trie is
created in the middle. You will have to define your input character set by
listing their continuous ranges of character codes in a .abm (alphabet map)
file when creating a trie. Then, each character will be automatically assigned
internal codes of continuous values.

Download

Update: The double-array trie implementation has been simplified
and rewritten from scratch in C, and is now named libdatrie. It is now
available under the terms of
GNU Lesser General Public
License (LGPL)
:

SVN: svn co http://linux.thai.net/svn/software/datrie

The old C++ source code below is under the terms of
GNU Lesser General Public
License (LGPL)
:

Other Implementations

References

    1. [Knuth1972]
      Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and
      Searching.
      Addison-Wesley. 1972.
    2. [Fredkin1960]
      Fredkin, E. Trie Memory. Communication of the
      ACM.
      Vol. 3:9 (Sep 1960). pp. 490-499.
    3. [Cohen1990]
      Cohen, D. Introduction to Theory of Computing. John
      Wiley & Sons. 1990.
    4. [Johnson1975]
      Johnson, S. C. YACC-Yet another compiler-compiler.
      Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.
    5. [Aho+1985]
      Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles,
      Techniques, and Tools.
      Addison-Wesley. 1985.
    6. [Aoe1989]
      Aoe, J. An Efficient Digital Search Algorithm by Using a
      Double-Array Structure.
      IEEE Transactions on Software
      Engineering.
      Vol. 15, 9 (Sep 1989). pp. 1066-1077.
    7. [Virach+1993]
      Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai
      Chansaenwilai. Thai Dictionary Data Base Manipulation using
      Multi-indexed Double Array Trie.
      5th Annual
      Conference.
      National Electronics and Computer
      Technology Center. Bangkok. 1993. pp 197-206. (in Thai)

An Implementation of Double-Array Trie的更多相关文章

  1. double array trie 插入结点总结

    双数组Trie树索引的可操作性研究.pdf 提示:任一状态点的移动,会影响其Trie树中父节点的base值的选择以及兄弟结点位置的变动,而兄弟结点的移动又须变更相应的子节点的check值. 设待插入的 ...

  2. 双数组字典树(Double Array Trie)

    参考文献 1.双数组字典树(DATrie)详解及实现 2.小白详解Trie树 3.论文<基于双数组Trie树算法的字典改进和实现> DAT的基本内容介绍这里就不展开说了,从Trie过来的同 ...

  3. sphinx索引分析——文件格式和字典是double array trie 检索树,索引存储 – 多路归并排序,文档id压缩 – Variable Byte Coding

    1 概述 这是基于开源的sphinx全文检索引擎的架构代码分析,本篇主要描述index索引服务的分析.当前分析的版本 sphinx-2.0.4 2 index 功能 3 文件表 4 索引文件结构 4. ...

  4. Double Array Trie 的Python实现

    不多介绍,可自行Google,或者其它关键词: "datrie" 放代码链接: double_array_trie.py 因为也是一段学习代码,参考的文章都记在里面了,主要参考gi ...

  5. 【转】B树、B-树、B+树、B*树、红黑树、 二叉排序树、trie树Double Array 字典查找树简介

    B  树 即二叉搜索树: 1.所有非叶子结点至多拥有两个儿子(Left和Right): 2.所有结点存储一个关键字: 3.非叶子结点的左指针指向小于其关键字的子树,右指针指向大于其关键字的子树: 如: ...

  6. Save and read double array in a binary file

    ;} 32 bytes read 9.5 -3.4 1 2.1 "

  7. 双数组Trie的一种实现

    An Implementation of Double-Array Trie 双数组Trie的一种实现 原文:http://linux.thai.net/~thep/datrie/datrie.htm ...

  8. 双数组trie树的基本构造及简单优化

    一 基本构造 Trie树是搜索树的一种,来自英文单词"Retrieval"的简写,可以建立有效的数据检索组织结构,是中文匹配分词算法中词典的一种常见实现.它本质上是一个确定的有限状 ...

  9. B树,B+树,红黑树应用场景AVL树,红黑树,B树,B+树,Trie树

    B B+运用在file system database这类持续存储结构,同样能保持lon(n)的插入与查询,也需要额外的平衡调节.像mysql的数据库定义是可以指定B+ 索引还是hash索引. C++ ...

随机推荐

  1. 【python游戏编程之旅】第二篇--pygame中的IO、数据

    本系列博客介绍以python+pygame库进行小游戏的开发.有写的不对之处还望各位海涵. 在上一篇中,我们介绍了pygame的入门操作http://www.cnblogs.com/msxh/p/49 ...

  2. c++ map 的使用

    1.map是一类关联式容器,它是模板类. 关联的本质在于元素的值与某个特定的键相关联,而并非通过元素在数组中的位置类获取.它的特点是增加和删除节点对迭代器的影响很小,除了操作节点,对其他的节点都没有什 ...

  3. JAVA发送邮件工具类

    import java.util.Date;import java.util.Properties; import javax.mail.BodyPart;import javax.mail.Mess ...

  4. Quartz与Spring整合进行热部署的实现(二)

    Spring的org.springframework.scheduling.quartz.JobDetailBean提供Job可序列化的实现(具体实现可查看源码) 此时.我们原来的job就可以继承Qu ...

  5. Tomcat设置默认启动项目及Java Web工程设置默认启动页面

    Tomcat设置默认启动项目 Tomcat设置默认启动项目,顾名思义,就是让可以在浏览器的地址栏中输入ip:8080,就能访问到我们的项目.具体操作如下: 1.打开tomcat的安装根目录,找到Tom ...

  6. 微博开发平台java SDK demo学习之friendships

    本文解释了在java SDK的demo中与feiendships有关的功能 截图如下: 关注一个用户(需要知道该用户uid) 取消关注一个用户(用户uid) 获取用户粉丝列表(授权用户的screen_ ...

  7. [转]Web Service Authentication

    本文转自:http://www.codeproject.com/Articles/9348/Web-Service-Authentication Download source files - 45. ...

  8. 【新产品发布】发布STM8S 核心板

    搞了一些STM8的核心板供大家把玩,先上几张图: 物品购买地址: http://item.taobao.com/item.htm?spm=686.1000925.1000774.17.5GMO5M&a ...

  9. 对.net系统架构改造的一点经验和教训(转)

    在互联网行业,基于Unix/Linux的网站系统架构毫无疑问是当今主流的架构解决方案,这不仅仅是因为Linux本身足够的开放性,更因为围绕传统Unix/Linux社区有大量的成熟开源解决方案,覆盖了网 ...

  10. var object dynamic的区别

    一.var var本身不是一种类型,只是一种语法糖:var声明的变量在赋值的时候即已决定其变量类型,编译时会进行校验. 二.object object是所以类型的基类,故可以赋任何类型的值. 三.dy ...