原文链接:http://www.codeproject.com/Articles/9680/Persistent-Data-Structures

Introduction

When you hear the word persistence in programming, most often, you think of an application saving its data to some type of storage, such as a database, so that the data can be retrieved later when the application is run again. There is, however, another
meaning for the word persistence when it is used to describe data structures, particularly those used in functional programming languages. In that context, a persistent data structure is a data structure capable of preserving the current version of itself
when modified. In essence, a persistent data structure is immutable.

An example of a class that uses this type of persistence in the .NET Framework is thestring class. Once a
string object is created, it cannot be changed. Any operation that appears to change a string generates a new string instead. Thus, each version of astring
object can be preserved. An advantage for a persistent class like thestring class is that it basically gives you undo functionality built-in. As newer versions of a persistent object are created, older
versions can be pushed onto a stack and popped off when you want to undo an operation. Another advantage is that because persistent data structures cannot change state, they are easier to reason about and are thread safe.

There is an overhead that comes with persistent data structures, however. Each operation that changes a persistent data structure creates a new version of that data structure. This can involve a good deal of copying to create the new version. This cost can
be mitigated to a large degree by reusing as much of the internal structure of the old version in creating a new one. I will explore this idea in making two common data structures persistent: the singly linked list and the binary tree, and describe a third
data structure that combines the two. I will also describe several classes I have created that are persistent versions of some of the classes in theSystem.Collections namespace.

top

Persistent Singly Linked Lists

The singly linked list is one of the most widely used data structures in programming. It consists of a series of nodes linked together one right after the other. Each node has a reference to the node that comes after it, and the last node in the list terminates
with a null reference. To traverse a singly linked list, you begin at the head of the list and move from one node to the next until you have reached the node you are looking for or have reached the last node:

Let's insert a new item into the list. This list is not persistent, meaning that it can be changed in-place without generating a new version. After taking a look at the insertion operation on a non-persistent list, we'll look at the same operation on a persistent
list.

Inserting a new item into a singly linked list involves creating a new node:

We will insert the new node at the fourth position in the list. First, we traverse the list until we've reached that position. Then the node that will precede the new node is unlinked from the next node...

...and relinked to the new node. The new node is, in turn, linked to the remaining nodes in the list:

Inserting a new item into a persistent singly linked list will not alter the existing list but create a new version with the item inserted into it. Instead of copying the entire list and then inserting the item into the copy, a better strategy is to reuse
as much of the old list as possible. Since the nodes themselves are persistent, we don't have to worry about aliasing problems.

To insert a new node at the fourth position, we traverse the list as before only copying each node along the way. Each copied node is linked to the next copied node:

The last copied node is linked to the new node, and the new node is linked to the remaining nodes in the old list:

On an average, about N/2 nodes will be copied in the persistent version for insertions and deletions, where N equals the number of nodes in the list. This isn't terribly efficient but does give us some savings. One persistent data structure where this approach
to singly linked list buys us a lot is the stack. Imagine the above data structure with insertions and deletions restricted to the head of the list. In this case, N nodes can be reused for pushing items onto a stack and N - 1 nodes can be reused for popping
a stack.

top

Persistent Binary Trees

A binary tree is a collection of nodes in which each node contains two links, one to its left child and another to its right child. Each child is itself a node, and either or both of the child nodes can be null, meaning that a node may have zero to two children.
In the binary search tree version, each node usually stores a key/value pair. The tree is searched and ordered according to its keys. The key stored at a node is always greater than the keys stored in its left descendents and always less than the keys stored
in its right descendents. This makes searching for any particular key very fast.

Here is an example of a binary search tree. The keys are listed as numbers; the values have been omitted but are assumed to exist. Notice how each key as you descend to the left is less than the key of its predecessor, and vice versa as you descend to the
right:

Changing the value of a particular node in a non-persistent tree involves starting at the root of the tree and searching for a particular key associated with that value, and then changing the value once the node has been found. Changing a persistent tree,
on the other hand, generates a new version of the tree. We will use the same strategy in implementing a persistent binary tree as we did for the persistent singly linked list, which is to reuse as much of the data structure as possible when making a new version.

Let's change the value stored in the node with the key 7. As the search for the key leads us down the tree, we copy each node along the way. If we descend to the left, we point the previously copied node's left child to the currently copied node. The previous
node's right child continues to point to nodes in the older version. If we descend to the right, we do just the opposite.

This illustrates the "spine" of the search down the tree. The red nodes are the only nodes that need to be copied in making a new version of the tree:

You can see that the majority of the nodes do not need to be copied. Assuming the binary tree is balanced, the number of nodes that need to be copied any time a write operation is performed is at most O(Log N), where Log is base 2. This is much more efficient
than the persistent singly linked list.

Insertions and deletions work the same way, only steps should be taken to keep the tree in balance, such as using an AVL tree. If a binary tree becomes degenerate, we run into the same efficiency problems as we did with the singly linked list.

top

Random Access Lists

An interesting persistent data structure that combines the singly linked list with the binary tree is Chris Okasaki'srandom-access list. This data structure allows
for random access of its items as well as adding and removing items from the beginning of the list. It is structured as a singly linked list of completely balanced binary trees. The advantage of this data structure is that it allows access, insertion, and
removal of the head of the list in O(1) time as well as provides logarithmic performance in randomly accessing its items.

Here is a random-access list with 13 items:

When a node is added to the list, the first two root nodes (if they exist) are checked to see if they both have the same height. If so, the new node is made the parent of the first two nodes; the current head of the list is made the left child of the new
node, and the second root node is made the right child. If the first two root nodes do not have the same height, the new node is simply placed at the beginning of the list and linked to the next tree in the list.

To remove the head of the list, the root node at the beginning of the list is removed, with its left child becoming the new head and its right child becoming the root of the second tree in the list. The new head of the list is right linked with the next
root node in the list:

The algorithm for finding a node at a specific index is in two parts: in the first part, we find the tree in the list that contains the node we're looking for. In the second part, we descend into the tree to find the node itself. The following algorithm
is used to find a node in the list at a specific index:

  1. Let I be the index of the node we're looking for. Set
    T
    to the head of the list where T will be our reference to the root node of the current tree in the list we're examining.
  2. If I is equal to 0, we've found the node we're looking for; terminate algorithm. Else ifI is greater than or equal to the number of nodes in
    T, subtract the number of nodes inT from
    I
    and set T to the root of the next tree in the list and repeat step 2. Else ifI is less than the number of nodes in
    T, go to step 3.
  3. Set S to the number of nodes in T divided by 2 (the fractional part of the division is ignored. For example, if the number of nodes in the current subtree is 3,S will be 1).
  4. If I is less than S, subtract 1 from
    I
    and set T to T's left child. Else subtract (S + 1) fromI and set
    T to T's right child.
  5. If I is equal to 0, we've found the node we're looking for; terminate algorithm. Else go to step 3.

This illustrates using the algorithm to find the 10th item in the list:

Keep in mind that all operations that change a random-access list do not change the existing list but rather generate a new version representing the change. As much of the old list is reused in creating a new version.

top

Immutable Collections

Included with this article are a number of persistent collection classes I have created. These classes are in a namespace calledImmutableCollections. I have created persistent versions of some of the collection classes in theSystem.Collections
namespace. I will describe each one and some of the challenges in making them persistent. There are several collection classes that are currently missing; I need to add a queue, for example. Hopefully, I will get to those in time. Also, even though I've taken
steps to make these classes efficient, they cannot compete with theSystem.Collections classes in terms of speed, but they really aren't meant to. They are meant to provide the advantages of immutability while providing reasonable performance.

top

Stack

This one was easy. Simply create a persistent singly linked list and limit insertions and deletions to the head of the list. Since this class is persistent, popping a stack returns a new version of the stack with the next item in the old stack as the new
top. In the System.Collections.Stack version, popping the stack returns the top of the stack. The question for the persistent version was how to make the top of the stack available since it cannot be returned when the stack is popped. I chose
to create a Top property that represents the top of the stack.

top

SortedList

The SortedList uses AVL tree algorithms to keep the tree in balance. I found it useful to create anIAvlNode interface. Two classes implement this interface, the
AvlNode class and the NullAvlNode class. The NullAvlNode class implements the null object design pattern. This simplified many of the algorithms.

top

ArrayList

This is the class that proved most challenging. Like the SortedList, it uses a persistent AVL tree as its data structure. However, unlike theSortedList, items are accessed by index (or by position) rather than by key. I have to
admit that the algorithms for accessing and inserting items in a binary tree by index weren't intuitive to me, so I turned to Knuth. Specifically, I used Algorithms B and C in section 6.2.3 in volume 3 of The Art of Computer Programming.

I made an assumption about the ArrayList in order to improve performance. I assumed that theAdd method is by far the most used method. However, adding items to theArrayList one right after the other causes a lot of
tree rotations to keep the tree in balance. To solve this, I created a template tree that is already completely balanced. Since this template tree is immutable, it can exist at the class level and be shared amongst all of the instances of the class.

When an instance of the ArrayList class is created, it takes a small subtree of the template tree. As items are added, the nodes in the template tree are replaced with new nodes. Since the tree is completely balanced, no rebalancing is necessary.
If the subtree gets filled up, another subtree of equal height is taken from the template tree and joined to the existing tree. Insertions and deletions are handled normally with rebalancing performed if necessary. Again, the assumption is that adding items
to the ArrayList occurs much more frequently than inserting or deleting items.

top

Array

The Array class uses the random access list structure to provide a persistent array with logarithmic performance. Unlike a random access list, it has a fixed size.

top

RandomAccessList

This class does not have a parallel in the System.Collections namespace, but it was one of the first persistent classes I wrote, and I decided to include it here. It's a straightforward implementation of Chris Okasaki'srandom-access
list
described above. This data structure was designed to be used in functional languages where lists have three basic operations:Cons,
Head, and Tail. Cons adds an item to the head of the list,Head is the first item in the list, and
Tail represents all of the items in the list except for theHead.

top

Conclusion

Persistent data structures help simplify programming by eliminating a whole class of bugs associated with side-effects and synchronization issues. They are not a cure-all but are a useful tool for helping a programmer deal with complexity. I have explored
ways of making data structures persistent and have provided a small .NET library of persistent data structures. I hope you have enjoyed the article, and as always, I welcome feedback.

top

History

02/23/2005 - First version.

Persistent Data Structures的更多相关文章

  1. Persistent and Transient Data Structures in Clojure

    此文已由作者张佃鹏授权网易云社区发布. 欢迎访问网易云社区,了解更多网易技术产品运营经验. 最近在项目中用到了Transient数据结构,使用该数据结构对程序执行效率会有一定的提高.刚刚接触Trans ...

  2. The Swiss Army Knife of Data Structures … in C#

    "I worked up a full implementation as well but I decided that it was too complicated to post in ...

  3. Important Abstractions and Data Structures

    For Developers‎ > ‎Coding Style‎ > ‎ Important Abstractions and Data Structures 目录 1 TaskRunne ...

  4. [轉]Linux Data Structures

    Table of Contents, Show Frames, No Frames Chapter 15 Linux Data Structures This appendix lists the m ...

  5. A library of generic data structures

    A library of generic data structures including a list, array, hashtable, deque etc.. https://github. ...

  6. 剪短的python数据结构和算法的书《Data Structures and Algorithms Using Python》

    按书上练习完,就可以知道日常的用处啦 #!/usr/bin/env python # -*- coding: utf-8 -*- # learn <<Problem Solving wit ...

  7. Go Data Structures: Interfaces

    refer:http://research.swtch.com/interfaces Go Data Structures: Interfaces Posted on Tuesday, Decembe ...

  8. Choose Concurrency-Friendly Data Structures

    What is a high-performance data structure? To answer that question, we're used to applying normal co ...

  9. 无锁数据结构(Lock-Free Data Structures)

    一个星期前,我写了关于SQL Server里闩锁(Latches)和自旋锁(Spinlocks)的文章.2个同步原语(synchronization primitives)是用来保护SQL Serve ...

随机推荐

  1. 十二. 一步步破解JEB 2.0demo版二

    编写脚本批量还愿JEB 加密字符串 解密完后效果如下: 脚本源码: https://github.com/bingghost/JebPlugins 思路: 下面的该封装的基本都封装了,过程如下: 1. ...

  2. [转]MySQL 最基本的SQL语法/语句

    MySQL 最基本的SQL语法/语句,使用mysql的朋友可以参考下.   DDL-数据定义语言(Create,Alter,Drop,DECLARE) DML-数据操纵语言(Select,Delete ...

  3. JVM内存管理------垃圾搜集器参数精解

    本文是GC相关的最后一篇,这次LZ只是罗列一下hotspot JVM中垃圾搜集器相关的重点参数,以及各个参数的解释.废话不多说,这就开始. 垃圾搜集器文章传送门 JVM内存管理------JAVA语言 ...

  4. Struts2与Struts1的区别

    Struts2是基于WebWork的一个全新框架.不过有了Struts1的基础,学Struts2更方便.Struts2主要改进是取代了Struts1的Servlet和Action.Struts2的核心 ...

  5. 【XLL 框架库函数】 TempErr/TempErr12

    创建一个包含了 Excel 工作表错误的临时 XLOPER/XLOPER12 原型 LPXLOPER TempErr(WORD err); LPXLOPER12 TempErr12(BOOL err) ...

  6. Java的几种常用设计模式

    何为设计模式? 就是对一些常见问题进行归纳总结,并针对具体问题给出一套通用的解决办法(强调的是解决问题的思想): 在开发中,只要遇到这类问题,就可以直接使用这些设计模式解决问题. ---------- ...

  7. 局部加权回归、欠拟合、过拟合(Locally Weighted Linear Regression、Underfitting、Overfitting)

    欠拟合.过拟合 如下图中三个拟合模型.第一个是一个线性模型,对训练数据拟合不够好,损失函数取值较大.如图中第二个模型,如果我们在线性模型上加一个新特征项,拟合结果就会好一些.图中第三个是一个包含5阶多 ...

  8. 命令大全/cmd/bash

    端口占用及强杀 cmd命令 netstat -aon|findstr "8080" #查看占用pid tasklist|findstr "2448" #查看被哪 ...

  9. hdu 5073

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=5073 思路:一开始忘了排序,wa了好几发...选择区间长度为N - K的连续的数, 然后其余的K个数都 ...

  10. NOI 2010 海拔 ——平面图转对偶图

    [题目分析] 可以知道,所有的海拔是0或1 最小割转最短路,就可以啦 SPFA被卡,只能换DIJ [代码] #include <cstdio> #include <cstring&g ...