Python2 中字典实现的分析【翻译】

在这片文章中会介绍 Python2 中字典的实现，Hash 冲突的解决方法以及在 C 语言中 Python 字典的具体结构，并分析了数据插入和删除的过程。翻译自python-dictionary-implementation 并加入了译者的一些思考。

字典的使用

字典通过 key 被索引，我们可以将其视为一个关联数组。

现在添加 3 组键值对到字典中：

>>> d = {'a': 1, 'b': 2}

>>> d['c'] = 3

>>> d

{'a': 1, 'b': 2, 'c': 3}

字典的值可以通过这种方式被访问，在访问不存在的键时，会抛出一个异常：

>>> d['a']

1

>>> d['b']

2

>>> d['c']

3

>>> d['d']

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

KeyError: 'd'

Hash Tables

Python 中的字典是通过 Hash Tables 来实现。Hash Tables 本身是个数组，具体操作时，通过 hash 函数来取得数组的索引。

hash 函数会将键均匀的放在数组中，一个优秀的 hash 函数会最小化 hash 冲突。

hash 冲突：不同 key 但通过 hash 函数运算后得到相同的 hash 结果。

如下面的例子，对 int 或者 string 类型的对象进行 hash 运算。

>>> map(hash, (0, 1, 2, 3))

[0, 1, 2, 3]

>>> map(hash, ("namea", "nameb", "namec", "named"))

[-1658398457, -1658398460, -1658398459, -1658398462]

这里会以 string 类型为例，简单介绍下 hash 函数的内部实现：

arguments: string object

returns: hash

function string_hash:

    if hash cached:

        return it

    set len to string's length

    initialize var p pointing to 1st char of string object

    set x to value pointed by p left shifted by 7 bits

    while len >= 0:

        set var x to (1000003 * x) xor value pointed by p

        increment pointer p

    set x to x xor length of string object

    cache x as the hash so we don't need to calculate it again

    return x as the hash

假设我们使用的是 64 bit 位的机器的话，在执行 hash(a) 方法时，就会调用 string_hash() 的方式并返回一个结果如 12416037344.

如果存储 key/value 的数组长度是 x，就可以用 x-1 作为掩码来计算相应的数组下标。假设数组的长度是 8 的话，'a' 对应的索引值的计算方法是：hash(a) & 7=0 . 类似的，'b' 的 index 是 3. 'c' 的 index 结果是 2. 'z' 的结果是 3. 这时 'z' 和 'b' 的 hash 值相同，也就遇到了常说的 hash 冲突。

译者注：使用 Python 3.6/64 bit 对上面的情况进行了模拟：

print(hash('a') & 7)

print(hash('b') & 7)

print(hash('c') & 7)

print(hash('z') & 7)

# first result

2

2

4

3

###################

print(hash('a') & 7)

print(hash('b') & 7)

print(hash('c') & 7)

print(hash('z') & 7)

# second result

6

0

4

6

# and so on

可以发现对同一数据多次执行 hash 时，运算的结果和冲突不同，这时由于每次调用 hash 函数式，引入了随机数。通常来说，在 key 值是连续的情况下，hash 冲突发生的几率会小些。反之，冲突发生的几率会增大。

这里可以使用链表来存储具有相同 hash 值的键值对，但是它会增加查找时间，时间花费的平均值不再是O（1）。接下来就简单介绍下，Python 中解决 hash 冲突的方法。

Hash 冲突的解决

Open addressing

开放寻址是一种使用探测的冲突解决方法。在例子中 'z' 的情况下，index 3 位置已被占用，所以需要重新探测一个未被占用的位置。对于增加和查询来说，平均花费 O(1) 的时间。

二次探测序列用于找到空闲的位置。代码如下：

j = (5*j) + 1 + perturb;

perturb >>= PERTURB_SHIFT;

use j % 2**i as the next table index;

在定期的执行5*j+1 时，会放大一些细小的不同，但不会影响初始化的索引位置。perturb 的作用是获取其他 bits 的 hash 值。

出于好奇心，下面是当 table 的大小是 32, j 的值是 3 时，探索值的大小：

3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

这里是实现的源码 dictobject.c 。对于探索序列的一个详细的解释可以在文件的顶部找到。下面简单介绍下字典的具体结构。

Dictionary C structures

下面是在 C 中用于描述字典的条目，key/value, hash 值。PyObject 是 Python 对象的基类。

typedef struct {

    Py_ssize_t me_hash;

    PyObject *me_key;

    PyObject *me_value;

} PyDictEntry;

下面的结构用于表示字典对象：

typedef struct _dictobject PyDictObject;

struct _dictobject {

    PyObject_HEAD

    Py_ssize_t ma_fill;

    Py_ssize_t ma_used;

    Py_ssize_t ma_mask;

    PyDictEntry *ma_table;

    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);

    PyDictEntry ma_smalltable[PyDict_MINSIZE];

};

ma_fill 是已用位置和 dummy 位置之和。
ma_used 表示已经被使用的位置。
ma_mask 表示数组的数量，最小值是 1. 在计算数组的索引时会被用到。
ma_table 是一个数组。
ma_smalltable 表示初始化的数组，大小是 8.

dummy ：当一个 key/value 对象被移除时，会被标记成 dummy.

Dictionary initialization

当创建字典时，函数 PyDict_New() 会被调用。这里移除了一些 python 的源代码，并用伪代码进行代替。

returns new dictionary object

function PyDict_New:

    allocate new dictionary object

    clear dictionary's table

    set dictionary's number of used slots + dummy slots (ma_fill) to 0

    set dictionary's number of active slots (ma_used) to 0

    set dictionary's mask (ma_value) to dictionary size - 1 = 7

    set dictionary's lookup function to lookdict_string

    return allocated dictionary object

Add items

当新的键值对被增加时，PyDict_SetItem() 函数会被调用. 该函数使用指向字典对象的指针和对应的键值对作为参数。它会检查 key 是否为 string 类型，并且计算 hash 值并判断是否有缓存可以使用。insertdict() 会被用来增加一个键值对，并且当 ma_fill (使用位置的数量加上被标记 dummy 位置的数量)超过 2/3 时，字典会被重新调整大小。 2/3 的原因是保证探索序列可以足够快的找到一个未被使用的位置。

arguments: dictionary, key, value

returns: 0 if OK or -1

function PyDict_SetItem:

    if key's hash cached:

        use hash

    else:

        calculate hash

    call insertdict with dictionary object, key, hash and value

    if key/value pair added successfully and capacity over 2/3:

        call dictresize to resize dictionary's table

inserdict() 使用 lookdict_string() 来查询可以使用的位置。这和使用查找 key 时是一样的。lookdict_string() 根据 hash 值和掩码值来计算空闲的位置。如果使用 index=hash&mask 求出的位置被占用，它会在循坏中一直探索，直到找到一个空闲的位置。如果在第一次查询的过程中 key 为空，会返回一个带有 dummy 标记的位置。这就保证了可以优先的重新使用之前删除的位置。

下面的来看具体的例子：

在字典中增加 {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24} :

一个字典的结构被分配，其内部表的大小是 8 

* PyDict_SetItem: key = ‘a’, value = 1

    hash = hash(‘a’) = 12416037344

    insertdict

        lookdict_string

            slot index = hash & mask = 12416037344 & 7 = 0

            slot 0 is not used so return it

        init entry at index 0 with key, value and hash

        ma_used = 1, ma_fill = 1

 * PyDict_SetItem: key = ‘b’, value = 2

    hash = hash(‘b’) = 12544037731

    insertdict

        lookdict_string

            slot index = hash & mask = 12544037731 & 7 = 3

            slot 3 is not used so return it

        init entry at index 3 with key, value and hash

        ma_used = 2, ma_fill = 2

 * PyDict_SetItem: key = ‘z’, value = 26

    hash = hash(‘z’) = 15616046971

    insertdict

        lookdict_string

            slot index = hash & mask = 15616046971 & 7 = 3

            slot 3 is used so probe for a different slot: 5 is free

        init entry at index 5 with key, value and hash

        ma_used = 3, ma_fill = 3

 * PyDict_SetItem: key = ‘y’, value = 25

    hash = hash(‘y’) = 15488046584

    insertdict

        lookdict_string

            slot index = hash & mask = 15488046584 & 7 = 0

            slot 0 is used so probe for a different slot: 1 is free

        init entry at index 1 with key, value and hash

        ma_used = 4, ma_fill = 4

* PyDict_SetItem: key = ‘c’, value = 3

    hash = hash(‘c’) = 12672038114

    insertdict

        lookdict_string

            slot index = hash & mask = 12672038114 & 7 = 2

            slot 2 is free so return it

        init entry at index 2 with key, value and hash

        ma_used = 5, ma_fill = 5

* PyDict_SetItem: key = ‘x’, value = 24

    hash = hash(‘x’) = 15360046201

    insertdict

        lookdict_string

                slot index = hash & mask = 15360046201 & 7 = 1

                slot 1 is used so probe for a different slot: 7 is free

        init entry at index 7 with key, value and hash

        ma_used = 6, ma_fill = 6

到目前为止，总共 8 个位置中 6 个位置已经被占用，超过了数组 2/3 的容量。dictresize() 会被调用重新分配一个更大的数组。它还会将过去的字典项复制到新分配的数组中。

在这个例子中，dictresize() 被调用时，会带有 minused=24 的参数，这是因为分配的原则是 4 * ma_used. 但当 ma_used 的数量超过 50000 时，原则改成 2 * ma_used.

为什么在分配时的 4 倍，是因为这样做会减小重新设置的步骤并且让数组变得更稀疏。

新的 hash 表的大小需要大于 24 ，所以可以通过左移一位的方式进行，直到最后的结果大于 24.(8 -> 16 -> 32).

下面是重新调整表的结果，一个大小是 32 的新表被分配。过去表的数据被插入到新的表中。插入的方式通过与新的掩码 31 做与操作得到。

Removing items

PyDict_DelItem() 被用于删除一个字典项。key 的哈希值被计算出来作为查找函数的参数，之后被删除的位置被标记成 dummy.

假如我们想从字典中，移除 key c ：

注意删除元素操作并不会触发重置数组大小的操作，即使使用的位置数量远远小于总共的位置数量。重置数组的操作基于，在增加 key/value 时，ma_fill 的数量（使用的数量+标记 dummy 的数量）。