偶然发现一篇哈希的综述文章,虽然是1996年写的,里面的一些评测在今天看来早已经小case了。但是今天仍然极具参考价值。

地址:http://www.drdobbs.com/database/hashing-rehashed/184409859

正文:

Hashing algorithms occupy a unique place in the hearts of programmers. Discovered early on in computer science, they are among the most approachable algorithms and certainly the most enjoyable to tinker with. Moreover, the endless search for the Holy Grail, a perfect hash function for a given data set, still consumes considerable ink in each year's batch of computer-science journals. For developers who program for a living, however, recondite refinements to unusual hashing functions are of little use. In general, when you need a hash table, you want a quick, convenient hashing function. And unless you know where to get one, you will almost certainly fall prey to the fallacy that you can write and optimize one quickly for your particular application.

  In this article, I will discuss a good, general-purpose hashing algorithm and then analyze some of the standard wisdom on how hashing can be made more efficient. Finally, I'll examine the surprising effect of hardware advances on hashing--in particular, the widening gap between CPU performance and memory-access latency.

Basics

  Hash tables are just like most tables (the old-fashioned word for "arrays"). A table becomes a hash table when nonsequential access to a given element is achieved through the use of a hash function. A hash function is generally handed a data element, which it converts to an integer that is used as an index into the table. For example, if you had a hash table into which you wanted to store dates, you might take the Julian date (a positive integer assigned to every date since January 1, 4713 bc) and divide it by the size of the hash table. The remainder, termed the "hash key," would be your index into the hash table.

  In a closed hash table, the hash key is the index of the element into which you store the date and the data associated with that date. If the slot in the table is occupied by another date (this is termed a "collision"), you can either generate another hash key (called "nonlinear rehashing") or step through the table until you find an available slot (linear rehashing).

  In an open hash table (by far the most common), each element in the table is the head of a linked list of data items. In this model, a collision is handled by adding the colliding element to the linked list that begins at the slot in the table that both keys reference. This approach has the advantage that the table does not implicitly limit the number of elements it can hold, whereas the closed table cannot accept more data items than it has elements.

Hash Functions 

  Hashing is useful anywhere a wide range of possible values needs to be stored in a small amount of memory and be retrieved with simple, near-random access. As such, hash tables occur frequently in database engines and especially in language processors such as compilers and assemblers, where they elegantly serve as symbol tables. In such applications, the table is a principal data structure. Because all accesses to the table must be done through the hash function, the function must fulfill two requirements: It must be fast and it must do a good job distributing keys throughout the table. The latter requirement minimizes collisions and prevents data items with similar values from hashing to just one part of the table.

  In most programs, the hash function occupies an insignificant amount of the execution time. In fact, it is difficult to write a useful program in which the hash function occupies as much as 1 percent of the execution time. If your profiler tells you otherwise, you need to change hash functions. I present here two excellent, general-purpose hash functions. HashPJW() (see Example 1) is based on work by Peter J. Weinberger of AT&T Bell Labs, and is better known.ElfHash() (Example 2) is a variant of HashPJW() that is used in UNIX object files that use the ELF format. The functions accept a pointer to a string and return an integer. The final hash key is the modulo generated by dividing the returned integer by the size of the table. The portable version of Weinberger's algorithm in Example 1 is attributable to Allen Holub. Unfortunately, Holub's version contains an error that was picked up in subsequent texts (including the first printing of Practical Algorithms for Programmers, by John Rex and myself). The bug does not prevent the code from working, it simply gives suboptimal hashes. The version in Example 1 is correct. Example 2 is taken from the "System V Application Binary Interface" specification. This algorithm is occasionally misprinted (see the Programming in Standard C manual for UnixWare, for instance) in a form that will give very bad results. The version I presented here is correct.

Example 1: HashPJW(). An adaptation of Peter Weinberger's generic hashing algorithm.

/*--- HashPJW ---------------------------------------------------
* An adaptation of Peter Weinberger's (PJW) generic hashing
* algorithm based on Allen Holub's version. Accepts a pointer
* to a datum to be hashed and returns an unsigned integer.
*-------------------------------------------------------------*/
#include <limits.h>
#define BITS_IN_int ( sizeof(int) * CHAR_BIT )
#define THREE_QUARTERS ((int) ((BITS_IN_int * 3) / 4))
#define ONE_EIGHTH ((int) (BITS_IN_int / 8))
#define HIGH_BITS ( ~((unsigned int)(~0) >> ONE_EIGHTH ))
unsigned int HashPJW ( const char * datum )
{
unsigned int hash_value, i;
for ( hash_value = 0; *datum; ++datum )
{
hash_value = ( hash_value << ONE_EIGHTH ) + *datum;
if (( i = hash_value & HIGH_BITS ) != 0 )
hash_value =
( hash_value ^ ( i >> THREE_QUARTERS )) &
~HIGH_BITS;
}
return ( hash_value );
}

  Example 2: ElfHash(). The published hash algorithm used in the UNIX ELF format for object files.

/*--- ElfHash ---------------------------------------------------
* The published hash algorithm used in the UNIX ELF format
* for object files. Accepts a pointer to a string to be hashed
* and returns an unsigned long.
*-------------------------------------------------------------*/
unsigned long ElfHash ( const unsigned char *name )
{
unsigned long h = 0, g;
while ( *name )
{
h = ( h << 4 ) + *name++;
if ( g = h & 0xF0000000 )
h ^= g >> 24;
h &= ~g;
}
return h;
}

  

  Both functions attain their speed by performing only bit manipulations. No arithmetic is performed and the input data is dereferenced just once per byte.

  Because a program spends so little time hashing, you should not tinker with an algorithm to optimize its performance. You should focus instead on the quality of its hashes. To do this, you will need a small program that will hash a data set and print statistics on the quality of the hash. The wordlist.exe program (a self-extracting source-code file) is available electronically (see code link at the top of the page) and will test your functions. The file also contains some useful test data sets. The program reads a text file and stores every unique word and a count of its occurrences into a hash table. A variety of command-line switches (all documented) allow you to specify the size of the hash table, print table statistics, and even print all the unique words and their counts.

  This test program shows that the two algorithms mentioned previously do a superb job distributing hash values throughout the table. In 16-bit environments, HashPJW() is marginally better at hashing text while ElfHash() is significantly better at hashing numbers. In 32-bit environments, the algorithms are identical.

Performance Concerns

  While a speedy algorithm does not lend itself to optimization, its hashes should be given attention if they are of poor quality. However, this attention should be viewed in context. Even hash-table-intensive programs (compilers do not figure in this group) will rarely be improved by more than 4 or 5 percent by migration from a so-so implementation to a perfect hash algorithm. So if poor-to-best under the most favorable circumstances generates only a 5 percent improvement, common sense suggests that tuning hash functions should be one of the very last optimizations you perform. This is doubly true because hash tuning requires considerable empirical testing: You do not want to improve performance on one data set while degrading it significantly on another.

  Again, unless you have exceptional needs, use one of the aforementioned functions. Several aspects of your hash table can be tuned easily to improve performance, however.

  The first of these is to make sure the table has a prime number of slots. As stated previously, the final hash index is the modulo of the hash value divided by the table size. The widest dispersal of modulo results will occur when the hash value and the table size are relatively prime (no common factors), which is most often assured when the table size is prime. For example, hashing the Bible, which consists of 42,829 unique words, into an open hash table with 30,241 elements (a prime number) shows that 76.6 percent of the slots were used and that the average chain was 1.85 words in length (with a maximum of 6). The same file run into a hash table of 30,240 elements (evenly divisible by integers 2 through 9) fills only 60.7 percent of the slots and the average chain is 2.33 words long (maximum: 10). This is a rather substantial improvement for adding only a single element to the table to make prime.

  A second easy improvement is increasing the size of the hash table. This optimization is one of the classic memory-for-speed trade-offs and is effective as long as the data set has more elements than your hash table. (I'm still talking about open hash tables here. Closed hash tables will be discussed shortly.) Going back to the example of the Bible, when it was hashed (with HashPJW()) into a table of 7561 elements, 99.7 percent of the slots were occupied and the average chain was 5.68 words long; at 15,121 elements, occupancy was 93.9 percent and an average chain was 3.01 words. The values for a table of 30,241 elements were shown previously. In terms of performance for this example, quadrupling the table size nearly halved the chain length. Consistent with the (overall) small role played by hash operations, the program's timed performance only improved by 1.7 percent. However, after all other optimizations are performed, this difference could be meaningful.

  Finally, be careful with the construction of the linked lists emanating from the table. Significant savings can be attained by not allocating memory each time a data element is added to the table, but by using a pool of nodes allocated as a single block. Likewise, the lists should be ordered, either by ascending value or by recency of access. Compiler symbol tables, for example, often place the most recently accessed node at the head of the list, on the theory that references to a variable will tend to be grouped together. This view is especially validated in languages such as C and Pascal, which have local variables.

Closed Hash Tables

As mentioned previously, the two primary methods of resolving hashing collisions in closed tables are linear rehashing or nonlinear rehashing. The idea behind nonlinear rehashing is that it helps disperse colliding values throughout the table.

Closed hash tables are particularly effective when the maximum size of the incoming data set is known and a table with many more elements can be allocated. It has been proven that once a closed table becomes more than 50 percent full, performance deteriorates significantly (seeData Structures and Program Design in C, by Robert Kruse, Bruce Leung, and Clovis Tondo, Prentice Hall, 1991). Closed tables are commonly used for rapid prototyping (they're easy to code quickly) and where fast access (no link list) is a priority and memory is easily available.

Traditional computer science has found that, given equal load (the ratio of occupied slots to table size), tables tend to perform better by using nonlinear rehashing than by stepping through the table. This view has generally been accepted without much debate because the math behind it is fairly compelling.

However, hardware advances (of all things!) are beginning to change the math in favor of linear rehashing. Specifically, the reasons are memory latency and CPU cache construction.

Memory Latency

  High-end CPUs today commonly operate at speeds in excess of 250 MHz. At such speeds, each instruction cycle takes exactly 4 nanoseconds. By comparison, the fastest widely available dynamic RAM runs at 60 nanoseconds. In effect, this means that a single memory fetch is approximately 15 times slower than a CPU instruction. From this large disparity, you can see that your programs should avoid memory fetches as much as possible.

  Most CPUs today have data caches on board, and the more your programs use this memory, rather than general DRAM, the better they will perform. As with all caches, their benefit accrues only if the data you need to access is already in the cache (when it is, you have a cache "hit"). CPU caches tend to hold chunks of data brought in over previous operations, and these chunks tend to include data past the byte or bytes used for the one instruction. As a result, cache hits will increase if you access adjoining bytes in subsequent instructions. Hence, you will have far more cache hits if you use linear rehashing and simply query the adjoining table element than if you use nonlinear rehashing.

  How significant is this difference? The program memtime.c (available electronically, see the top of the page) shows you the effects of memory latency. It allocates a programmer-defined block and performs as many reads as there are bytes. In one pass, it does the reads randomly; in the second, it does the reads sequentially. Table 1 presents the results for 10 million reads into a 10-MB table.

Table 1: The results for 10 million reads into a 10-MB table.

Processor/Memory Sequential Random
100-MHz Pentium/60-ns memory 1 second 7 seconds
66-MHz Cyrix 486/70-ns memory 2 seconds 9 seconds

  Predictably, the ratio becomes even more unfavorable when the program runs on operating systems with paged memory, such as UNIX. When the previous program was run on a 486/66-MHz PC with 70-ns memory running UnixWare, the sequential reads took 3 seconds and the random reads took 35 seconds.

  the point is clear: Memory accesses that occur together should be located together.

  You could argue that saving a handful of seconds across 10 million memory accesses is hardly worth the effort. In many cases this will be true. Nonetheless, coding for this optimization may often be a simple matter of choosing between equal efforts (as in the case of rehashing approaches). Moreover, every indication suggests that CPU speed will continue to increase, while DRAM access speeds have remained stable for several years. As time passes, the benefits of considering memory latency will increase.

  In the specific case of rehashing, the linear approach has an additional benefit: A new hash key does not have to be computed. So, for closed hash tables, keep them sparsely populated and use linear rehashing.

  Memory latency is bound to become more significant in other areas of programming. Bear this in mind as you code.

A brief introduction to Hashing and Rehashing的更多相关文章

  1. Optimizing shaper — hashing filters (HTB)

    I have a very nice shaper in my linux box :-) How the configurator works — it’s another question, he ...

  2. [Guava官方文档翻译] 1.Guava简介 (Introduction)

    用户指南 Guava包含Google在Java项目中用到的一些核心库:collections, caching, primitives support, concurrency 库, common a ...

  3. double hashing 双重哈希

    二度哈希(rehashing / double hashing) 1.二度哈希的工作原理如下: 有一个包含多个哈希函数(H1……Hn)的集合.当我们要从哈希表中添加或获取元素时,首先使用哈希函数H1. ...

  4. Reading task(Introduction to Algorithms. 2nd)

    Introduction to Algorithms 2nd ed. Cambridge, MA: MIT Press, 2001. ISBN: 9780262032933. Introduction ...

  5. Supervised Hashing with Kernels, KSH

    Notation 该论文中应用到较多符号,为避免混淆,在此进行解释: n:原始数据集的大小 l:实验中用于监督学习的数据集大小(矩阵S行/列的大小) m:辅助数据集,用于得到基于核的哈希函数 r:比特 ...

  6. Spherical Hashing,球哈希

    1. Introduction 在传统的LSH.SSH.PCA-ITQ等哈希算法中,本质都是利用超平面对数据点进行划分,但是在D维空间中,至少需要D+1个超平面才能形成一个封闭.紧凑的区域.而球哈希方 ...

  7. 局部敏感哈希-Locality Sensitivity Hashing

    一. 近邻搜索 从这里开始我将会对LSH进行一番长篇大论.因为这只是一篇博文,并不是论文.我觉得一篇好的博文是尽可能让人看懂,它对语言的要求并没有像论文那么严格,因此它可以有更强的表现力. 局部敏感哈 ...

  8. 局部敏感哈希 Kernelized Locality-Sensitive Hashing Page

    Kernelized Locality-Sensitive Hashing Page   Brian Kulis (1) and Kristen Grauman (2)(1) UC Berkeley ...

  9. Algorithms - Data Structure - Perfect Hashing - 完全散列

    相关概念 散列表 hashtable 是一种实现字典操作的有效数据结构. 在散列表中,不是直接把关键字作为数组的下标,而是根据关键字计算出相应的下标. 散列函数 hashfunction'h' 除法散 ...

随机推荐

  1. Latex 初学者入门(四)-- 多个作者共享同一个地址

    又给老板改格式,其实感觉大多会议都是模板不同,不同主要在于注释,作者,摘要以及引用文献的不同,上次的那篇讲bib数据库的用法,真是倒腾了一整天,不知道为什么一定要使用这种东西,而且老板貌似对人家的风格 ...

  2. python各个模块循环引用问题解决办法

    当项目中的模块过多,或功能划分不够清晰时会出现循环引用的问题,如下 有两个模块moduleA 和 moduleB: #moduleA from moduleB import b def a(): pr ...

  3. SpringMVC处理方法的数据绑定

    一.SpringMVC数据绑定流程 Spring MVC通过反射机制对目标处理方法的签名进行解析,将请求消息中的信息以一定的方式转换并绑定到处理方法的入参中.数据绑定的核心部件是DataBinder, ...

  4. 10个步骤让你成为高效的Web开发者

    要成为高产.高效的Web开发者,这需要我们做很多工作,来提高我们的工作方式,以及改善我们的劳动成果. 下面是10个提高效率的步骤,虽然不能保证解决你在开发中的所有问题,但至少是非常实用的,可以简化你的 ...

  5. python 两个队列进行对比

    python 两个队列进行对比 list01 = [1,2,3,4] list02 = [1,3,5] for i01 in list01: is_in_02 = False for i02 in l ...

  6. C++和.net的集合类对应

      Here's what I've found (ignoring the old non-generic collections): Array - C array, though the .NE ...

  7. windows下安装redis 以及phpredis的扩展 (windows redis php&php7)

    一.工具准备 1. redis for windows 下载 https://github.com/MSOpenTech/redis 2. PHP扩展下载 http://pecl.php.net/pa ...

  8. js 特效

    栏目1 栏目1->菜单1 栏目1->菜单2 栏目1->菜单3 栏目1->菜单4 栏目2 栏目2->菜单1 栏目2->菜单2 栏目2->菜单3 栏目2-> ...

  9. WdColor 枚举 (Word)

    指定要应用的 位颜色. 名称 值 说明 wdColorAqua 水绿色. wdColorAutomatic - 自动配色.默认值:通常为黑色. wdColorBlack 黑色. wdColorBlue ...

  10. 记录一下自己常用的maven工程的pom.xml模板

    1. 带有hadoop-CDH4.2.1的pom.xml模板 <?xml version="1.0" encoding="UTF-8"?> < ...