jedis ShardedJedisPool的 HASH一致性算法(二)为什么采用质数31

hash 是用于分配值到一或多个存储里,并且为了之后快速检索.如下图,需要插入一个新的item

基于hash值,必须在一定时间(当然是越快越好)告知值在哪个块中保存

这个是非常慢的hash检索逻辑

1
2
3
4
5
for(i = 0; i < totalkeys; i++)
     {
     if (CheckForEquality(key[i], "Samuel"))
          return i;
     }

这种顺序查询的算法成本是 (O(n)), 随着KEY越多,性能越差。并且假如类型比较复杂,那检索成本就越难以接受 。

所以我们有2个问题

  1. 复杂的值难以比较
  2. 顺序查询性能太差

解决复杂值比较

The easy way out for this issue is to derive  a way to decompose complex values into a key or a hash that is easy to work with. The easiest way to achieve this of-course is to generate UNIQUE numbers from your value. The number has to be UNIQUE since we want to distinguish one value from another. This UNIQUE side of things is where primes come in handy.

质数

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

为什么使用31?

Ok, so now you have your unique identifier generated for each value. Now how do you allocate this value to a bucket or a container you have.

Lets say your container is a fixed array of 16 items. The easiest way to decide where to stick the value or the generated unique number, from now on referred to as the hash key, is to stick the key in the same location as its value.

So if your hash is 3 you stick your key in the location number 3. But of-course what would happen if the hash number is bigger than your container size. For this reason the initial or the very first hashing algorithms used to divide the generated hash key and stick the key in the location pointed to by the remainder. (This again necessitated that the no of storage locations in a hash was a prime number so as to generate unique remainder values).

So assuming your hash has 11 items, the hash key 12, would be stuck into 12%11 remainder 1 – 1st  location.

为什么使用31?

在阵列里,使用不同的key得到不同的分布。比如同样的 “abc” 使用31 就是 a*31+b*31+c*31 ,而是用29就是 a*29+b*29+c*29。因为质数不同,所以生成的key也不同,研究者发现使用31这个质数会达到更好的key分配,更少的值冲突。不过没有人知道为什么…… 33派也有人这么说33,-_-#

/*
     * This is the popular `times 33' hash algorithm which is used by
     * perl and also appears in Berkeley DB. This is one of the best
     * known hash functions for strings because it is both computed
     * very fast and distributes very well.
     *
     * The originator may be Dan Bernstein but the code in Berkeley DB
     * cites Chris Torek as the source. The best citation I have found
     * is "Chris Torek, Hash function for text in C, Usenet message
     * <27038@mimsy.umd.edu> in comp.lang.c , October, 1990." in Rich
     * Salz's USENIX 1992 paper about INN which can be found at
     * <http://citeseer.nj.nec.com/salz92internetnews.html>.
     *
     * The magic of number 33, i.e. why it works better than many other
     * constants, prime or not, has never been adequately explained by
     * anyone. So I try an explanation: if one experimentally tests all
     * multipliers between 1 and 256 (as I did while writing a low-level
     * data structure library some time ago) one detects that even
     * numbers are not useable at all. The remaining 128 odd numbers
     * (except for the number 1) work more or less all equally well.
     * They all distribute in an acceptable way and this way fill a hash
     * table with an average percent of approx. 86%.
     *
     * If one compares the chi^2 values of the variants (see
     * Bob Jenkins ``Hashing Frequently Asked Questions'' at
     * http://burtleburtle.net/bob/hash/hashfaq.html for a description
     * of chi^2), the number 33 not even has the best value. But the
     * number 33 and a few other equally good numbers like 17, 31, 63,
     * 127 and 129 have nevertheless a great advantage to the remaining
     * numbers in the large set of possible multipliers: their multiply
     * operation can be replaced by a faster operation based on just one
     * shift plus either a single addition or subtraction operation. And
     * because a hash function has to both distribute good _and_ has to
     * be very fast to compute, those few numbers should be preferred.
     *
     *                  -- Ralf S. Engelschall <rse@engelschall.com>
     */

 

冲突

有可能生成相同的key。这种情况下,单个hash存储变成一个链表或其他存储结构来保存重复的key(如果使用相同hash地址的数据过多,那么这些数据所组成的hash链就更长,从而降低了查询效率!所以在选择系数的时候要选择尽量长的系数并且让乘法尽量不要溢出的系数,因为如果计算出来的hash地址越大,所谓的“冲突”就越少,查找起来效率也会提高),所以单个hash存储 被称为bucket(桶).

更好的hash算法

上述的算法还是过于简单,有更快和更复杂的hash算法可供我们采用,例如

Paul Hseih’s Hash

Bob Jenkin’s Hash and his Dr Dobbs Article on the same

One from Google

posted @ 2014-03-21 16:22  tree.liang  阅读(564)  评论(0编辑  收藏  举报