jedis ShardedJedisPool的 HASH一致性算法 (三) murmurhash

 


murmurhash的c++版

 1 //-----------------------------------------------------------------------------
 2 // MurmurHash2, by Austin Appleby
 3 
 4 // Note - This code makes a few assumptions about how your machine behaves -
 5 
 6 // 1. We can read a 4-byte value from any address without crashing
 7 // 2. sizeof(int) == 4
 8 
 9 // And it has a few limitations -
10 
11 // 1. It will not work incrementally.
12 // 2. It will not produce the same results on little-endian and big-endian
13 //    machines.
14 
15 unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
16 {
17     // 'm' and 'r' are mixing constants generated offline.
18     // They're not really 'magic', they just happen to work well.
19 
20     const unsigned int m = 0x5bd1e995;
21     const int r = 24;
22 
23     // Initialize the hash to a 'random' value
24 
25     unsigned int h = seed ^ len;
26 
27     // Mix 4 bytes at a time into the hash
28 
29     const unsigned char * data = (const unsigned char *)key;
30 
31     while(len >= 4)
32     {
33         unsigned int k = *(unsigned int *)data;
34 
35         k *= m; 
36         k ^= k >> r; 
37         k *= m; 
38         
39         h *= m; 
40         h ^= k;
41 
42         data += 4;
43         len -= 4;
44     }
45     
46     // Handle the last few bytes of the input array
47 
48     switch(len)
49     {
50     case 3: h ^= data[2] << 16;
51     case 2: h ^= data[1] << 8;
52     case 1: h ^= data[0];
53             h *= m;
54     };
55 
56     // Do a few final mixes of the hash to ensure the last few
57     // bytes are well-incorporated.
58 
59     h ^= h >> 13;
60     h *= m;
61     h ^= h >> 15;
62 
63     return h;
64 } 
View Code

jedis中的java版

 1 /**
 2      * Hashes the bytes in a buffer from the current position to the limit.
 3      * 
 4      * @param buf
 5      *            The bytes to hash.
 6      * @param seed
 7      *            The seed for the hash.
 8      * @return The 32 bit murmur hash of the bytes in the buffer.
 9      */
10     public static int hash(ByteBuffer buf, int seed) {
11         // save byte order for later restoration
12         ByteOrder byteOrder = buf.order();
13         buf.order(ByteOrder.LITTLE_ENDIAN);
14 
15         int m = 0x5bd1e995;
16         int r = 24;
17 
18         int h = seed ^ buf.remaining();
19 
20         int k;
21         while (buf.remaining() >= 4) {
22             k = buf.getInt();
23 
24             k *= m;
25             k ^= k >>> r;
26             k *= m;
27 
28             h *= m;
29             h ^= k;
30         }
31 
32         if (buf.remaining() > 0) {
33             ByteBuffer finish = ByteBuffer.allocate(4).order(
34                     ByteOrder.LITTLE_ENDIAN);
35             // for big-endian version, use this first:
36             // finish.position(4-buf.remaining());
37             finish.put(buf).rewind();
38             h ^= finish.getInt();
39             h *= m;
40         }
41 
42         h ^= h >>> 13;
43         h *= m;
44         h ^= h >>> 15;
45 
46         buf.order(byteOrder);
47         return h;
48     }
49 
50 //64位 
51 
52     public static long hash64A(ByteBuffer buf, int seed) {
53         ByteOrder byteOrder = buf.order();
54         buf.order(ByteOrder.LITTLE_ENDIAN);
55 
56         long m = 0xc6a4a7935bd1e995L;
57         int r = 47;
58 
59         long h = seed ^ (buf.remaining() * m);
60 
61         long k;
62         while (buf.remaining() >= 8) {
63             k = buf.getLong();
64 
65             k *= m;
66             k ^= k >>> r;
67             k *= m;
68 
69             h ^= k;
70             h *= m;
71         }
72 
73         if (buf.remaining() > 0) {
74             ByteBuffer finish = ByteBuffer.allocate(8).order(
75                     ByteOrder.LITTLE_ENDIAN);
76             // for big-endian version, do this first:
77             // finish.position(8-buf.remaining());
78             finish.put(buf).rewind();
79             h ^= finish.getLong();
80             h *= m;
81         }
82 
83         h ^= h >>> r;
84         h *= m;
85         h ^= h >>> r;
86 
87         buf.order(byteOrder);
88         return h;
89     }
View Code

 

 

MurmurHash

 

By Austin Appleby

 

MurmurHash (http://murmurhash.googlepages.com/) is a multiplicative hash function with a mixing step written by Austin Appleby.  It aims to be a fast statistically strong hashing algorithm. 

"The name, if you're wondering, comes from the simplest sequence of operations which will thoroughly mix the bits of a value - "x *= m; x = rotate_left(x,r);" - multiply and rotate. Repeat that about 15 times using 'good' values of m and r, and x will end up pseudo-randomized. Unfortunately multiply+rotate has a few major weaknesses when used in a hash function, so I used multiply+shift+xor. I liked the name Murmur better than Musxmusx, so I kept it."

MurmurHash was designed with four main goals in mind:

    1. Distribution – Passing Chi-Square distribution tests for all keysets and bucket sizes.
    2. Avalanche Behaviour – Maximum bias of 0.5%.
    3. Collision Resistance – No collisions possible for 4-byte keys, no small (1- to 7-bit) differentials.
    4. Simple and Fast – As few instructions as possible, as fast as possible while remaining statistically strong.      

 

----------------------------------------Discussion---------------------------------------
 

(Note, this was written for MurmurHash 1.0 and is only loosely applicable to 2.0) 

 

I'm going to put random brain dump stuff here.

 

The constants for MurmurHash were found by searching for parameters that fit the following conditions -

1. The mixing step 

   x *= m; 
   x ^= x >> r1;

   should achieve nearly complete avalanche after two iterations.


2. The mixing step

   x *= m;
   x ^= x >> r1
   x *= m;
   x ^= x >> r2
   x *= m;
   x ^= x >> r3

   should achieve nearly complete avalanche.


3. The distribution of the hash as a whole should produce a mimimal chi-square value on both easy (dictionary) and hard (sparse) keysets.


A bit of testing indicated that r1 pretty much had to be 16. Once I'd found a constant that passed condition 1, exhaustive search determined that r2 and r3 were best set at 10 and 17. From there I searched iteratively - find a value of m that improves on condition 3, test it against conditions 1 and 2 (searching for new r2 and r3 values), keep the new m if it's acceptable, rinse and repeat.

The current value of m (0xc6a4a793) produces an avalanche bias of 0.15% for condition 2 - I've found constants that produce values as low as 0.09%, but they don't fare as well on the chi-square test.

 


Andres Valloud mentioned that the avalanche condition isn't sufficient for proving a hash is random, and he's clearly right - 

unsigned int PassesAvalancheButIsAwful ( const void * blob, int len )
{
    return (MD5(blob,len) & 1) ? 0xFFFFFFFF : 0;
}

will pass the avalanche test even though it can only produce two possible values (it of course would fail the chi-square test).

Similarly, for a given hash table size you can create a hash function that passes the chi-square test with flying colors but fails the avalanche test -

unsigned int PassesChiSquaredButIsAwful ( const void * blob, int len )
{
    return MD5(blob,len) % table_size;
}

I suspect you can construct one that would pass for all table sizes but still fail avalance, but I'm not certain how to go about that.

 Anyhow, it seems that the two tests together do a good job of weeding out poor hash functions - chi-square to catch "random" but non-uniform distributions, avalanche to catch good distributions but poor mixing.

 

You could also throw the bit independence critera (BIC) into the mix, which is similar to avalance but adds an extra dimension - for each 1-bit input differential, compute the output differential and see how often each possible pair of output bits flip - the values 00, 01, 10, and 11 should appear equally.

I'm not certain how applicable this is to hash functions though - Murmur actually has some significant weaknesses in the BIC test, yet they don't appear to affect the quality of its output. Flipping the final mix shift values from (10,17) to (17,10) seems to fix this, at a cost of a few slightly biased (2%) bits in the avalanche result. Doing so doesn't improve any of the actual test results though, so I don't think it's worth worrying about.

posted @ 2014-03-22 01:14  tree.liang  阅读(1128)  评论(0编辑  收藏  举报