哈希表的C实现（三）—传说中的暴雪版

现在的位置: 首页 > 综合 > 正文

哈希表的C实现（三）—传说中的暴雪版

2013年05月26日 ⁄ 综合 ⁄ 共 12747字 ⁄ 字号小中大 ⁄ 评论关闭

转载于http://www.cnblogs.com/xiekeli/archive/2012/01/17/2324433.html

--------------------------------------------------------------------

关于哈希表C实现，写了两篇学习笔记，不过似乎网上流传最具传奇色彩的莫过于暴雪公司的魔兽文件打包管理器里的hashTable的实现了；在冲突方面的处理方面，采用线性探测再散列。在添加和查找过程中进行了三次哈希，第一个哈希值用来查找，后两个哈希值用来校验，这样可以大大减少冲突的几率。

在网上找了相关代码，但不知道其来源是否地道：

StringHash.h

 #include <StdAfx.h>
 #include <string>
 
 using namespace std;
 
 #pragma once
 
 #define MAXTABLELEN 1024    // 默认哈希索引表大小 
 //////////////////////////////////////////////////////////////////////////  
 // 哈希索引表定义  
typedef struct  _HASHTABLE
{  
    long nHashA;  
    long nHashB;  
    bool bExists;  
}HASHTABLE, *PHASHTABLE ;  
 
class StringHash
{
 public:
    StringHash(const long nTableLength = MAXTABLELEN);
    ~StringHash(void);
 private:  
    unsigned long cryptTable[0x500];  
    unsigned long m_tablelength;    // 哈希索引表长度  
    HASHTABLE *m_HashIndexTable; 
 private:
    void InitCryptTable();                                               // 对哈希索引表预处理 
    unsigned long HashString(const string &lpszString, unsigned long dwHashType); // 求取哈希值      
 public:
    bool Hash(string url);
    unsigned long Hashed(string url);    // 检测url是否被hash过
 };

StringHash.cpp

#include "StdAfx.h"
#include "StringHash.h"

StringHash::StringHash(const long nTableLength /*= MAXTABLELEN*/)
{
    InitCryptTable();  
    m_tablelength = nTableLength;  
    //初始化hash表
    m_HashIndexTable = new HASHTABLE[nTableLength];  
    for ( int i = 0; i < nTableLength; i++ )  
    {  
        m_HashIndexTable[i].nHashA = -1;  
        m_HashIndexTable[i].nHashB = -1;  
        m_HashIndexTable[i].bExists = false;  
    }          
}

StringHash::~StringHash(void)
{
    //清理内存
    if ( NULL != m_HashIndexTable )  
    {  
        delete []m_HashIndexTable;  
        m_HashIndexTable = NULL;  
        m_tablelength = 0;  
    }  
}

/************************************************************************/
/*函数名：InitCryptTable
/*功  能：对哈希索引表预处理  
/*返回值：无
/************************************************************************/
void StringHash::InitCryptTable()  
{   
    unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;  

    for( index1 = 0; index1 < 0x100; index1++ )  
    {   
        for( index2 = index1, i = 0; i < 5; i++, index2 += 0x100 )  
        {   
            unsigned long temp1, temp2;  
            seed = (seed * 125 + 3) % 0x2AAAAB;  
            temp1 = (seed & 0xFFFF) << 0x10;  
            seed = (seed * 125 + 3) % 0x2AAAAB;  
            temp2 = (seed & 0xFFFF);  
            cryptTable[index2] = ( temp1 | temp2 );   
        }   
    }   
}  

/************************************************************************/
/*函数名：HashString
/*功  能：求取哈希值   
/*返回值：返回hash值
/************************************************************************/
unsigned long StringHash::HashString(const string& lpszString, unsigned long dwHashType)  
{   
    unsigned char *key = (unsigned char *)(const_cast<char*>(lpszString.c_str()));  
    unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;  
    int ch;  

    while(*key != 0)  
    {   
        ch = toupper(*key++);  

        seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);  
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;   
    }  
    return seed1;   
}  

/************************************************************************/
/*函数名：Hashed
/*功  能：检测一个字符串是否被hash过
/*返回值：如果存在，返回位置；否则，返回-1
/************************************************************************/
unsigned long StringHash::Hashed(string lpszString)  

{   
    const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;  
    //不同的字符串三次hash还会碰撞的几率无限接近于不可能
    unsigned long nHash = HashString(lpszString, HASH_OFFSET);  
    unsigned long nHashA = HashString(lpszString, HASH_A);  
    unsigned long nHashB = HashString(lpszString, HASH_B);  
    unsigned long nHashStart = nHash % m_tablelength,  
    nHashPos = nHashStart;  

    while ( m_HashIndexTable[nHashPos].bExists)  
    {   
        if (m_HashIndexTable[nHashPos].nHashA == nHashA && m_HashIndexTable[nHashPos].nHashB == nHashB)   
            return nHashPos;   
        else   
            nHashPos = (nHashPos + 1) % m_tablelength;  

        if (nHashPos == nHashStart)   
            break;   
    }  

    return -1; //没有找到  
}  

/************************************************************************/
/*函数名：Hash
/*功  能：hash一个字符串 
/*返回值：成功，返回true；失败，返回false
/************************************************************************/
bool StringHash::Hash(string lpszString)
{  
    const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;  
    unsigned long nHash = HashString(lpszString, HASH_OFFSET);  
    unsigned long nHashA = HashString(lpszString, HASH_A);  
    unsigned long nHashB = HashString(lpszString, HASH_B);  
    unsigned long nHashStart = nHash % m_tablelength, 
        nHashPos = nHashStart;  

    while ( m_HashIndexTable[nHashPos].bExists)  
    {   
        nHashPos = (nHashPos + 1) % m_tablelength;  
        if (nHashPos == nHashStart) //一个轮回  
        {  
            //hash表中没有空余的位置了,无法完成hash
            return false;   
        }  
    }  
    m_HashIndexTable[nHashPos].bExists = true;  
    m_HashIndexTable[nHashPos].nHashA = nHashA;  
    m_HashIndexTable[nHashPos].nHashB = nHashB;  

    return true;  
}

关于其中的实现原理，我觉得没有比 inside
MPQ说得清楚的了，于是用我蹩脚的E文，将该文的第二节翻译了一遍（将原文和译文都贴出来，请高手指正）：

原理

Most of the advancements throughout the history of computers have been because of particular problems which required solving. In this chapter, we'll take a look at some of these problems and their solutions as they pertain to the MPQ format.

贯穿计算机发展历史，大多数进步都是源于某些问题的解决，在这一节中，我们来看一看与MPQ 格式相关问题及解决方案；

Hashes

哈希表

Problem: You have a very large array of strings. You have another string and need to know if it is already in the list. You would probably begin by comparing each string in the list with the string other, but when put into application, you would find that this method is far too slow for practical use. Something else must be done. But how can you know if the string exists without comparing it to all the other strings?

问题：你有一个很大的字符串数组，同时，你另外还有一个字符串，需要知道这个字符串是否已经存在于字符串数组中。你可能会对数组中的每一个字符串进行比较，但是在实际项目中，你会发现这种做法对某些特殊应用来说太慢了。必须寻求其他途径。那么如何才能在不作遍历比较的情况下知道这个字符串是否存在于数组中呢？

Solution: Hashes. Hashes are smaller data types (i.e. numbers) that represent other, larger, data types (usually strings). In this scenario, you could store hashes in the array with the strings. Then you could compute the hash of the other string and compare it to the stored hashes. If a hash in the array matches the new hash, the strings can be compared to verify the match. This method, called indexing, could speed things up by about 100 times, depending on the size of the array and the average length of the strings.

解决方案：哈希表。哈希表是通过更小的数据类型表示其他更大的数据类型。在这种情况下，你可以把哈希表存储在字符串数组中，然后你可以计算字符串的哈希值，然后与已经存储的字符串的哈希值进行比较。如果有匹配的哈希值，就可以通过字符串比较进行匹配验证。这种方法叫索引，根据数组的大小以及字符串的平均长度可以约100倍。

unsigned long HashString(char *lpszString)
{   
    unsigned long ulHash = 0xf1e2d3c4;        
while (*lpszString != 0)    
    {        
        ulHash <<= 1;       
        ulHash += *lpszString++;      
    }   
return ulHash;
}

The previous code function demonstrates a very simple hashing algorithm. The function sums the characters in the string, shifting the hash value left one bit before each character is added in. Using this algorithm, the string "arr\units.dat" would hash to 0x5A858026, and "unit\neutral\acritter.grp" would hash to 0x694CD020. Now, this is, admittedly, a very simple algorithm, and it isn't very useful, because it would generate a relatively predictable output, and a lot of collisions in the lower range of numbers. Collisions are what happen when more than one string hash to the same value.

上面代码中的函数演示了一种非常简单的散列算法。这个函数在遍历字符串过程中，将哈希值左移一位，然后加上字符值；通过这个算法，字符串"arr\units.dat" 的哈希值是0x5A858026，字符串"unit\neutral\acritter.grp" 的哈希值是0x694CD020；现在，众所周知的，这是一个基本没有什么实用价值的简单算法，因为它会在较低的数据范围内产生相对可预测的输出，从而可能会产生大量冲突（不同的字符串产生相同的哈希值）。

The MPQ format, on the other hand, uses a very complicated hash algorithm (shown below) to generate totally unpredictable hash values. In fact, the hashing algorithm is so effective that it is called a one-way hash. A one-way hash is a an algorithm that is constructed in such a way that deriving the original string (set of strings, actually) is virtually impossible. Using this particular algorithm, the filename "arr\units.dat" would hash to 0xF4E6C69D, and "unit\neutral\acritter.grp" would hash to 0xA26067F3.

MPQ格式，使用了一种非常复杂的散列算法（如下所示），产生完全不可预测的哈希值，这个算法十分有效，这就是所谓的单向散列算法。通过单向散列算法几乎不可能通过哈希值来唯一的确定输入值。使用这种算法，文件名 "arr\units.dat" 的哈希值是0xF4E6C69D，"unit\neutral\acritter.grp" 的哈希值是 0xA26067F3。

unsigned long HashString(char *lpszFileName, unsigned long dwHashType)
{   
    unsigned char *key = (unsigned char *)lpszFileName;   
    unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;   
int ch;

while(*key != 0)       
    {      
        ch = toupper(*key++);   
        seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);       
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;       
    }   
return seed1;  
}

Hash Tables

哈希表

Problem: You tried using an index like in the previous sample, but your program absolutely demands break-neck speeds, and indexing just isn't fast enough. About the only thing you could do to make it faster is to not check all of the hashes in the array. Or, even better, if you could only make one comparison in order to be sure the string doesn't exist anywhere in the array. Sound too good to be true? It's not.

问题：您尝试在前面的示例中使用相同索引，您的程序一定会有中断现象发生，而且不够快。如果想让它更快，您能做的只有让程序不去查询数组中的所有散列值。或者您可以只做一次对比就可以得出在列表中是否存在字符串。听起来不错，真的么？不可能的啦

Solution: A hash table. A hash table is a special type of array in which the offset of the desired string is the hash of that string. What I mean is this. Say that you make that string array use a separate array of fixed size (let's say 1024 entries, to make it an even power of 2) for the hash table. You want to see if the new string is in that table. To get the string's place in the hash table, you compute the hash of that string, then modulo (division remainder) that hash value by the size of that table. Thus, if you used the simple hash algorithm in the previous section, "arr\units.dat" would hash to 0x5A858026, making its offset 0x26 (0x5A858026 divided by 0x400 is 0x16A160, with a remainder of 0x26). The string at this location (if there was one) would then be compared to the string to add. If the string at 0x26 doesn't match or just plain doesn't exist, then the string to add doesn't exist in the array. The following code illustrates this:

解决：一个哈希表就是以字符串的哈希值作为下标的一类数组。我的意思是，哈希表使用一个固定长度的字符串数组（比如1024，2的偶次幂）进行存储；当你要看看这个字符串是否存在于哈希表中，为了获取这个字符串在哈希表中的位置，你首先计算字符串的哈希值，然后哈希表的长度取模。这样如果你像上一节那样使用简单的哈希算法，字符串"arr\units.dat" 的哈希值是0x5A858026,偏移量0x26（0x5A858026 除于0x400等于0x16A160，模0x400等于0x26）。因此，这个位置的字符串将与新加入的字符串进行比较。如果0X26处的字符串不匹配或不存在，那么表示新增的字符串在数组中不存在。下面是示意的代码：

int GetHashTablePos(char *lpszString, SOMESTRUCTURE *lpTable, int nTableSize)
{   
int nHash = HashString(lpszString), nHashPos = nHash % nTableSize;       
if (lpTable[nHashPos].bExists && !strcmp(lpTable[nHashPos].pString, lpszString))       
return nHashPos;   
else        
return -1; //Error value   
}

Now, there is one glaring flaw in that explanation. What do you think happens when a collision occurs (two different strings hash to the same value)? Obviously, they can't occupy the same entry in the hash table. Normally, this is solved by each entry in the hash table being a pointer to a linked list, and the linked list would hold all the entries that hash to that same value.

上面的说明中存在一个刺眼的缺陷。当有冲突（两个不同的字符串有相同的哈希值）发生的时候怎么办？显而易见的，它们不能占据哈希表中的同一个位置。通常的解决办法是为每一个哈希值指向一个链表，用于存放所有哈希冲突的值；

MPQs use a hash table of filenames to keep track of the files inside, but the format of this table is somewhat different from the way hash tables are normally done. First of all, instead of using a hash as an offset, and storing the actually filename for verification, MPQs do not store the filename at all, but rather use three different hashes: one for the hash table offset, two for verification. These two verification hashes are used in place of the actual filename. Of course, this leaves the possibility that two different filenames would hash to the same three hashes, but the chances of this happening are, on average, 1:18889465931478580854784, which should be safe enough for just about anyone.

MPQs使用一个存放文件名的哈希表来跟踪文件内部，但是表的格式与通常方法有点不同，首先不像通常的做法使用哈希值作为偏移量，存储实际的文件名。MPQs 根本不存储文件名，而是使用了三个不同的哈希值：一个用做哈希表偏移量，两个用作核对。这两个核对的哈希值用于替代文件名。当然从理论上说存在两个不同的文件名得到相同的三个哈希值，但是这种情况发送的几率是：1:18889465931478580854784,这应该足够安全了。

The other way that an MPQ's hash table differs from the conventional implementation is that instead of using a linked list for each entry, when a collision occurs, the entry will be shifted to the next slot, and the process repeated until a free space is found. Take a look at the following illustrational code, which is basically the way a file is located for reading in an MPQ:

MPQ's的哈希表的实现与传统实现的另一个不同的地方是，相对与传统做法（为每个节点使用一个链表，当冲突发生的时候，遍历链表进行比较），看一下下面的示范代码，在MPQ中定位一个文件进行读操作：

int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize)
{   
const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;    
int nHash = HashString(lpszString, HASH_OFFSET),nHashA = HashString(lpszString, HASH_A),nHashB = HashString(lpszString, HASH_B), nHashStart = nHash % nTableSize,nHashPos = nHashStart;
while (lpTable[nHashPos].bExists)
    {
if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
return nHashPos;

else
            nHashPos = (nHashPos + 1) % nTableSize;
if (nHashPos == nHashStart)
break;
    }
return -1; //Error value

}

However convoluted that code may look, the theory behind it isn't difficult. It basically follows this process when looking to read a file:

Compute the three hashes (offset hash and two check hashes) and store them in variables.

Move to the entry of the offset hash

Is the entry unused? If so, stop the search and return 'file not found'.

Do the two check hashes match the check hashes of the file we're looking for? If so, stop the search and return the current entry.

Move to the next entry in the list, wrapping around to the beginning if we were on the last entry.

Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.

Go back to step 3.