哈希表的C实现（三）---传说中的暴雪版

关于哈希表C实现，写了两篇学习笔记，不过似乎网上流传最具传奇色彩的莫过于暴雪公司的魔兽文件打包管理器里的hashTable的实现了；在冲突方面的处理方面，采用线性探测再散列。在添加和查找过程中进行了三次哈希，第一个哈希值用来查找，后两个哈希值用来校验，这样可以大大减少冲突的几率。

在网上找了相关代码，但不知道其来源是否地道：

StringHash.h

 1 #include <StdAfx.h>
 2 #include <string>
 3 
 4 using namespace std;
 5 
 6 #pragma once
 7 
 8 #define MAXTABLELEN 1024    // 默认哈希索引表大小 
 9 //////////////////////////////////////////////////////////////////////////  
10 // 哈希索引表定义  
11 typedef struct  _HASHTABLE
12 {  
13     long nHashA;  
14     long nHashB;  
15     bool bExists;  
16 }HASHTABLE, *PHASHTABLE ;  
17 
18 class StringHash
19 {
20 public:
21     StringHash(const long nTableLength = MAXTABLELEN);
22     ~StringHash(void);
23 private:  
24     unsigned long cryptTable[0x500];  
25     unsigned long m_tablelength;    // 哈希索引表长度  
26     HASHTABLE *m_HashIndexTable; 
27 private:
28     void InitCryptTable();                                               // 对哈希索引表预处理 
29     unsigned long HashString(const string &lpszString, unsigned long dwHashType); // 求取哈希值      
30 public:
31     bool Hash(string url);
32     unsigned long Hashed(string url);    // 检测url是否被hash过
33 };

StringHash.cpp

#include "StdAfx.h"
#include "StringHash.h"

StringHash::StringHash(const long nTableLength /*= MAXTABLELEN*/)
{
    InitCryptTable();  
    m_tablelength = nTableLength;  
    //初始化hash表
    m_HashIndexTable = new HASHTABLE[nTableLength];  
    for ( int i = 0; i < nTableLength; i++ )  
    {  
        m_HashIndexTable[i].nHashA = -1;  
        m_HashIndexTable[i].nHashB = -1;  
        m_HashIndexTable[i].bExists = false;  
    }          
}

StringHash::~StringHash(void)
{
    //清理内存
    if ( NULL != m_HashIndexTable )  
    {  
        delete []m_HashIndexTable;  
        m_HashIndexTable = NULL;  
        m_tablelength = 0;  
    }  
}

/************************************************************************/
/*函数名：InitCryptTable
/*功  能：对哈希索引表预处理  
/*返回值：无
/************************************************************************/
void StringHash::InitCryptTable()  
{   
    unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;  

    for( index1 = 0; index1 < 0x100; index1++ )  
    {   
        for( index2 = index1, i = 0; i < 5; i++, index2 += 0x100 )  
        {   
            unsigned long temp1, temp2;  
            seed = (seed * 125 + 3) % 0x2AAAAB;  
            temp1 = (seed & 0xFFFF) << 0x10;  
            seed = (seed * 125 + 3) % 0x2AAAAB;  
            temp2 = (seed & 0xFFFF);  
            cryptTable[index2] = ( temp1 | temp2 );   
        }   
    }   
}  

/************************************************************************/
/*函数名：HashString
/*功  能：求取哈希值   
/*返回值：返回hash值
/************************************************************************/
unsigned long StringHash::HashString(const string& lpszString, unsigned long dwHashType)  
{   
    unsigned char *key = (unsigned char *)(const_cast<char*>(lpszString.c_str()));  
    unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;  
    int ch;  

    while(*key != 0)  
    {   
        ch = toupper(*key++);  

        seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);  
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;   
    }  
    return seed1;   
}  

/************************************************************************/
/*函数名：Hashed
/*功  能：检测一个字符串是否被hash过
/*返回值：如果存在，返回位置；否则，返回-1
/************************************************************************/
unsigned long StringHash::Hashed(string lpszString)  

{   
    const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;  
    //不同的字符串三次hash还会碰撞的几率无限接近于不可能
    unsigned long nHash = HashString(lpszString, HASH_OFFSET);  
    unsigned long nHashA = HashString(lpszString, HASH_A);  
    unsigned long nHashB = HashString(lpszString, HASH_B);  
    unsigned long nHashStart = nHash % m_tablelength,  
    nHashPos = nHashStart;  

    while ( m_HashIndexTable[nHashPos].bExists)  
    {   
        if (m_HashIndexTable[nHashPos].nHashA == nHashA && m_HashIndexTable[nHashPos].nHashB == nHashB)   
            return nHashPos;   
        else   
            nHashPos = (nHashPos + 1) % m_tablelength;  

        if (nHashPos == nHashStart)   
            break;   
    }  

    return -1; //没有找到  
}  

/************************************************************************/
/*函数名：Hash
/*功  能：hash一个字符串 
/*返回值：成功，返回true；失败，返回false
/************************************************************************/
bool StringHash::Hash(string lpszString)
{  
    const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;  
    unsigned long nHash = HashString(lpszString, HASH_OFFSET);  
    unsigned long nHashA = HashString(lpszString, HASH_A);  
    unsigned long nHashB = HashString(lpszString, HASH_B);  
    unsigned long nHashStart = nHash % m_tablelength, 
        nHashPos = nHashStart;  

    while ( m_HashIndexTable[nHashPos].bExists)  
    {   
        nHashPos = (nHashPos + 1) % m_tablelength;  
        if (nHashPos == nHashStart) //一个轮回  
        {  
            //hash表中没有空余的位置了,无法完成hash
            return false;   
        }  
    }  
    m_HashIndexTable[nHashPos].bExists = true;  
    m_HashIndexTable[nHashPos].nHashA = nHashA;  
    m_HashIndexTable[nHashPos].nHashB = nHashB;  

    return true;  
}

关于其中的实现原理，我觉得没有比 inside MPQ说得清楚的了，于是用我蹩脚的E文，将该文的第二节翻译了一遍（将原文和译文都贴出来，请高手指正）：

原理

Most of the advancements throughout the history of computers have been because of particular problems which required solving. In this chapter, we'll take a look at some of these problems and their solutions as they pertain to the MPQ format.

贯穿计算机发展历史，大多数进步都是源于某些问题的解决，在这一节中，我们来看一看与MPQ 格式相关问题及解决方案；

Hashes

哈希表

Problem: You have a very large array of strings. You have another string and need to know if it is already in the list. You would probably begin by comparing each string in the list with the string other, but when put into application, you would find that this method is far too slow for practical use. Something else must be done. But how can you know if the string exists without comparing it to all the other strings?

问题：你有一个很大的字符串数组，同时，你另外还有一个字符串，需要知道这个字符串是否已经存在于字符串数组中。你可能会对数组中的每一个字符串进行比较，但是在实际项目中，你会发现这种做法对某些特殊应用来说太慢了。必须寻求其他途径。那么如何才能在不作遍历比较的情况下知道这个字符串是否存在于数组中呢？

Solution: Hashes. Hashes are smaller data types (i.e. numbers) that represent other, larger, data types (usually strings). In this scenario, you could store hashes in the array with the strings. Then you could compute the hash of the other string and compare it to the stored hashes. If a hash in the array matches the new hash, the strings can be compared to verify the match. This method, called indexing, could speed things up by about 100 times, depending on the size of the array and the average length of the strings.

解决方案：哈希表。哈希表是通过更小的数据类型表示其他更大的数据类型。在这种情况下，你可以把哈希表存储在字符串数组中，然后你可以计算字符串的哈希值，然后与已经存储的字符串的哈希值进行比较。如果有匹配的哈希值，就可以通过字符串比较进行匹配验证。这种方法叫索引，根据数组的大小以及字符串的平均长度可以约100倍。

unsigned long HashString(char *lpszString)
{   
    unsigned long ulHash = 0xf1e2d3c4;        
    while (*lpszString != 0)    
    {        
        ulHash <<= 1;       
        ulHash += *lpszString++;      
    }   
    return ulHash;
}

The previous code function demonstrates a very simple hashing algorithm. The function sums the characters in the string, shifting the hash value left one bit before each character is added in. Using this algorithm, the string "arr\units.dat" would hash to 0x5A858026, and "unit\neutral\acritter.grp" would hash to 0x694CD020. Now, this is, admittedly, a very simple algorithm, and it isn't very useful, because it would generate a relatively predictable output, and a lot of collisions in the lower range of numbers. Collisions are what happen when more than one string hash to the same value.

上面代码中的函数演示了一种非常简单的散列算法。这个函数在遍历字符串过程中，将哈希值左移一位，然后加上字符值；通过这个算法，字符串"arr\units.dat" 的哈希值是0x5A858026，字符串"unit\neutral\acritter.grp" 的哈希值是0x694CD020；现在，众所周知的，这是一个基本没有什么实用价值的简单算法，因为它会在较低的数据范围内产生相对可预测的输出，从而可能会产生大量冲突（不同的字符串产生相同的哈希值）。

The MPQ format, on the other hand, uses a very complicated hash algorithm (shown below) to generate totally unpredictable hash values. In fact, the hashing algorithm is so effective that it is called a one-way hash. A one-way hash is a an algorithm that is constructed in such a way that deriving the original string (set of strings, actually) is virtually impossible. Using this particular algorithm, the filename "arr\units.dat" would hash to 0xF4E6C69D, and "unit\neutral\acritter.grp" would hash to 0xA26067F3.

MPQ格式，使用了一种非常复杂的散列算法（如下所示），产生完全不可预测的哈希值，这个算法十分有效，这就是所谓的单向散列算法。通过单向散列算法几乎不可能通过哈希值来唯一的确定输入值。使用这种算法，文件名 "arr\units.dat" 的哈希值是0xF4E6C69D，"unit\neutral\acritter.grp" 的哈希值是 0xA26067F3。

unsigned long HashString(char *lpszFileName, unsigned long dwHashType)
{   
    unsigned char *key = (unsigned char *)lpszFileName;   
    unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;   
    int ch;

    while(*key != 0)       
    {      
        ch = toupper(*key++);   
        seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);       
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;       
    }   
    return seed1;  
}

Hash Tables

哈希表

Problem: You tried using an index like in the previous sample, but your program absolutely demands break-neck speeds, and indexing just isn't fast enough. About the only thing you could do to make it faster is to not check all of the hashes in the array. Or, even better, if you could only make one comparison in order to be sure the string doesn't exist anywhere in the array. Sound too good to be true? It's not.

问题：您尝试在前面的示例中使用相同索引，您的程序一定会有中断现象发生，而且不够快。如果想让它更快，您能做的只有让程序不去查询数组中的所有散列值。或者您可以只做一次对比就可以得出在列表中是否存在字符串。听起来不错，真的么？不可能的啦

Solution: A hash table. A hash table is a special type of array in which the offset of the desired string is the hash of that string. What I mean is this. Say that you make that string array use a separate array of fixed size (let's say 1024 entries, to make it an even power of 2) for the hash table. You want to see if the new string is in that table. To get the string's place in the hash table, you compute the hash of that string, then modulo (division remainder) that hash value by the size of that table. Thus, if you used the simple hash algorithm in the previous section, "arr\units.dat" would hash to 0x5A858026, making its offset 0x26 (0x5A858026 divided by 0x400 is 0x16A160, with a remainder of 0x26). The string at this location (if there was one) would then be compared to the string to add. If the string at 0x26 doesn't match or just plain doesn't exist, then the string to add doesn't exist in the array. The following code illustrates this:

解决：一个哈希表就是以字符串的哈希值作为下标的一类数组。我的意思是，哈希表使用一个固定长度的字符串数组（比如1024，2的偶次幂）进行存储；当你要看看这个字符串是否存在于哈希表中，为了获取这个字符串在哈希表中的位置，你首先计算字符串的哈希值，然后哈希表的长度取模。这样如果你像上一节那样使用简单的哈希算法，字符串"arr\units.dat" 的哈希值是0x5A858026,偏移量0x26（0x5A858026 除于0x400等于0x16A160，模0x400等于0x26）。因此，这个位置的字符串将与新加入的字符串进行比较。如果0X26处的字符串不匹配或不存在，那么表示新增的字符串在数组中不存在。下面是示意的代码：

int GetHashTablePos(char *lpszString, SOMESTRUCTURE *lpTable, int nTableSize)
{   
    int nHash = HashString(lpszString), nHashPos = nHash % nTableSize;       
    if (lpTable[nHashPos].bExists && !strcmp(lpTable[nHashPos].pString, lpszString))       
        return nHashPos;   
    else        
        return -1; //Error value   
}

Now, there is one glaring flaw in that explanation. What do you think happens when a collision occurs (two different strings hash to the same value)? Obviously, they can't occupy the same entry in the hash table. Normally, this is solved by each entry in the hash table being a pointer to a linked list, and the linked list would hold all the entries that hash to that same value.

上面的说明中存在一个刺眼的缺陷。当有冲突（两个不同的字符串有相同的哈希值）发生的时候怎么办？显而易见的，它们不能占据哈希表中的同一个位置。通常的解决办法是为每一个哈希值指向一个链表，用于存放所有哈希冲突的值；

MPQs use a hash table of filenames to keep track of the files inside, but the format of this table is somewhat different from the way hash tables are normally done. First of all, instead of using a hash as an offset, and storing the actually filename for verification, MPQs do not store the filename at all, but rather use three different hashes: one for the hash table offset, two for verification. These two verification hashes are used in place of the actual filename. Of course, this leaves the possibility that two different filenames would hash to the same three hashes, but the chances of this happening are, on average, 1:18889465931478580854784, which should be safe enough for just about anyone.

MPQs使用一个存放文件名的哈希表来跟踪文件内部，但是表的格式与通常方法有点不同，首先不像通常的做法使用哈希值作为偏移量，存储实际的文件名。MPQs 根本不存储文件名，而是使用了三个不同的哈希值：一个用做哈希表偏移量，两个用作核对。这两个核对的哈希值用于替代文件名。当然从理论上说存在两个不同的文件名得到相同的三个哈希值，但是这种情况发送的几率是：1:18889465931478580854784,这应该足够安全了。

The other way that an MPQ's hash table differs from the conventional implementation is that instead of using a linked list for each entry, when a collision occurs, the entry will be shifted to the next slot, and the process repeated until a free space is found. Take a look at the following illustrational code, which is basically the way a file is located for reading in an MPQ:

MPQ's的哈希表的实现与传统实现的另一个不同的地方是，相对与传统做法（为每个节点使用一个链表，当冲突发生的时候，遍历链表进行比较），看一下下面的示范代码，在MPQ中定位一个文件进行读操作：

int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize)
{   
    const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;    
    int nHash = HashString(lpszString, HASH_OFFSET),nHashA = HashString(lpszString, HASH_A),nHashB = HashString(lpszString, HASH_B), nHashStart = nHash % nTableSize,nHashPos = nHashStart;
    while (lpTable[nHashPos].bExists)
    {
        if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
            return nHashPos;

        else
            nHashPos = (nHashPos + 1) % nTableSize;
            if (nHashPos == nHashStart)
                break;
    }
    return -1; //Error value

}

However convoluted that code may look, the theory behind it isn't difficult. It basically follows this process when looking to read a file:

Compute the three hashes (offset hash and two check hashes) and store them in variables.

Move to the entry of the offset hash

Is the entry unused? If so, stop the search and return 'file not found'.

Do the two check hashes match the check hashes of the file we're looking for? If so, stop the search and return the current entry.

Move to the next entry in the list, wrapping around to the beginning if we were on the last entry.

Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.

Go back to step 3.

无论代码看上去有多么复杂，其背后的理论并不难。读一个文件的时候基本遵循下面这样一个过程：

1、计算三个哈希值（一个哈希偏移量和两个验证值）并保存到变量中；

2、移动到哈希偏移量对应的值；

3、对应的位置是否尚未使用？如果是，则停止搜寻，并返回“文件不存在”；

4、这两个验证值是否与我们要找的字符串验证值匹配，如果是，停止搜寻，并返回当前的节点；

5、移动到下一个节点，如果到了最后一个节点则返回开始；

6、Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.

7、回到第3步；

If you were paying attention, you might have noticed from my explanation and sample code is that the MPQ's hash table has to hold all the file entries in the MPQ. But what do you think happens when every hash-table entry gets filled? The answer might surprise you with its obviousness: you can't add any more files. Several people have asked me why there is a limit (called the file limit) on the number of files that can be put in an MPQ, and if there is any way around this limit. Well, you already have the answer to the first question. As for the second; no, you cannot get around the file limit. For that matter, hash tables cannot even be resized without remaking the entire MPQ from scratch. This is because the location of each entry in the hash table may well change due to the resizing, and we would not be able to derive the new position because the position is the hash of the file name, and we may not know the file name.

如果您注意的话，您可能已经从我们的解释和示例代码注意到，MPQ的哈希表已经将所有的文件入口放入MPQ中；那么当哈希表的每个项都被填充的时候，会发生什么呢？答案可能会让你惊讶：你不能添加任何文件。有些人可能会问我为什么文件数量上有这样的限制（文件限制），是否有办法绕过这个限制？就此而言，如果不重新创建MPQ 的项，甚至无法调整哈希表的大小。这是因为每个项在哈希表中的位置会因为跳闸尺寸而改变，而我们无法得到新的位置，因为这些位置值是文件名的哈希值，而我们根本不知道文件名是什么。

一连总结了3篇关于哈希表的C实现，都是来源于网络，整理学习，以备忘之；不能说都搞得很清楚，大致知道了哈希表是怎么实现的；当然还有很多开源项目都有自己的实现，如LUA、Redis、Apache等，精力有限，先挖个坑，日后有时间再补充吧。不管怎么说，有点孔乙己的嫌疑，呵呵！

哈希表的C实现（三）---传说中的暴雪版的更多相关文章

16 BasicHashTable基本哈希表类(三)——Live555源码阅读(一)基本组件类
这是Live555源码阅读的第一部分,包括了时间类,延时队列类,处理程序描述类,哈希表类这四个大类. 本文由乌合之众 lym瞎编,欢迎转载 http://www.cnblogs.com/oloroso ...
第三十四篇玩转数据结构——哈希表（HashTable）
1.. 整型哈希函数的设计小范围正整数直接使用小范围负整数整体进行偏移大整数,通常做法是"模一个素数" 2.. 浮点型哈希函数的设计转成整型进行处理 3.. 字符串 ...
哈希表和字典List和Ilist和array和arraylist的应用
string x = string.Empty; string y = string.Empty; Hashtable ht = new Hashtable(); ...
[译]聊聊C＃中的泛型的使用（新手勿入） Seaching TreeVIew WPF 可编辑树Ztree的使用（包括对后台数据库的增删改查）字段和属性的区别 C# 遍历Dictionary并修改其中的Value 学习笔记——异步程序员常说的「哈希表」是个什么鬼？
[译]聊聊C#中的泛型的使用(新手勿入) 写在前面今天忙里偷闲在浏览外文的时候看到一篇讲C#中泛型的使用的文章,因此加上本人的理解以及四级没过的英语水平斗胆给大伙进行了翻译,当然在翻译的过程中发 ...
[PHP内核探索]PHP中的哈希表
在PHP内核中,其中一个很重要的数据结构就是HashTable.我们常用的数组,在内核中就是用HashTable来实现.那么,PHP的HashTable是怎么实现的呢?最近在看HashTable的数据 ...
Java基础知识笔记（一：修饰词、向量、哈希表）
一.Java语言的特点(养成经常查看Java在线帮助文档的习惯) (1)简单性:Java语言是在C和C++计算机语言的基础上进行简化和改进的一种新型计算机语言.它去掉了C和C++最难正确应用的指针和最 ...
【哈希表】CodeVs1230元素查找
一.写在前面哈希表(Hash Table),又称散列表,是一种可以快速处理插入和查询操作的数据结构.哈希表体现着函数映射的思想,它将数据与其存储位置通过某种函数联系起来,其在查询时的高效性也体现在这 ...
深入理解PHP内核(六)哈希表以及PHP的哈希表实现
原文链接:http://www.orlion.ga/241/ 一.哈希表(HashTable) 大部分动态语言的实现中都使用了哈希表,哈希表是一种通过哈希函数,将特定的键映射到特定值得一种数据结构, ...
[转]net中哈希表的使用 Hashtable
本文转自:http://www.cnblogs.com/gsk99/archive/2011/08/28/2155988.html 以下是PetShop中DBHelper中的使用过程: //创建哈希表 ...

随机推荐

T3186 队列练习2 codevs
http://codevs.cn/problem/3186/ 题目描述 Description (此题与队列练习1相比改了2处:1加强了数据 2不保证队空时不会出队)给定一个队列(初始为空),只有两种 ...
codevs——2370 小机房的树
2370 小机房的树时间限制: 1 s 空间限制: 256000 KB 题目等级 : 钻石 Diamond 题解题目描述 Description 小机房有棵焕狗种的树,树上有N个 ...
c++引用和const 用法数组指针
非const引用,只能用object为其赋值: <c++primer>P52 而const引用则可以用临时变量为其赋值: 如: const int &r = 32://可以 int ...
Windows-速度优化的几个方面
One. Win+R - > cmd- > msconfig 禁用不需要的启动项! Two. 关闭一些视觉选项 Three. 设置应用启动快捷键
linux nc，nmap,telnet ，natstat命令
说明在服务器运维中通常需要知道机器端口状态是否开启是否被防火墙拦截等.今天我们介绍这三个命令用来检测端口. nc 命令 / TCP # 安装 yum install -y nc nc 命令 ...
【POJ 3292】 Semi-prime H-numbers
[POJ 3292] Semi-prime H-numbers 打个表题意是1 5 9 13...这样的4的n次方+1定义为H-numbers H-numbers中仅仅由1*自己这一种方式组成即没 ...
SQLDMO注冊
在维护人事系统时.师姐给我们提出一个功能上有问题. 备份数据库时.报黄页.然后须要我们寻找原因,作出解决方式. 一開始我从原先在本机上公布的系统入手,发现没有出现故障.可是.当对程序进行调试时,就出现 ...
Redis相关知识
Redis 存储的五种字符串类型:string 一个String类型的value最大可以存储512M String是最常用的一种数据类型,普通的key/value存储. 散列类型: hash 键值 ...
linux 输入子系统（1） -Event types
输入系统协议用类型types和编码codecs来表示输入设备的值并用此来通知用户空间的应用程序. input协议是一个基于状态的协议,只有当相应事件编码对应的参数值发生变化时才会发送该事件.不过,状态 ...
java随记2
1.Arrays java8里新添加了parallelSort等parallel开头的方法,表示利用cpu并行的能力 2.面向对象如果继承树里的某个类要被初始化时,系统将会同时初始化该类的所有父类 ...

哈希表的C实现（三）---传说中的暴雪版

哈希表的C实现（三）---传说中的暴雪版的更多相关文章

随机推荐

热门专题