Try again

200512270453934121.gif

博客园 首页 联系 管理

The Rules for GetHashCode - Monday 21 June, 2004, 11:29 PM

While we're on the subject of hash codes, The Technologist also expressed surprise at one of the features of GetHashCode (his emphasis):

"I actually learnt something that I didn't foresee: A hashcode can be duplicated (it is non unique) AND it may change during the lifetime of an object, no guarantee is made."

People often express surprise that hash codes are non-unique. But when you think about it, it's inevitable, as they're only 32 bits long. How could a 64-bit integer possible have a different hash code for every possible value? There's just not enough space in the hash code to do that. In general, any data structure longer than 32 bits (or rather, with more than 32 bits of information in it) will have multiple values that are different according to Equals, but which hash to the same hash code.

He also observes:

"Object's implement of GetHashCode was the one that didn't generate repeatable values (for the lifetime of a given object)"

This is the one that seems (to me) more surprising. How could that be? Surely I can create any number of objects, meaning that I will eventually have used all the possible hash codes. And 'eventually' turns out not to take all that long. On my laptop, the CLR can create and GC about 20 million objects a second. To create more than 2^32 objects takes less than 4 minutes. It would therefore have to start recycling hash codes at this point - there are only 2^32 possible different hash codes, after all. Indeed it's pretty easy to show this happening:

using System;
using System.Collections;

class FindDuplicateObjectHashes
{
    static void Main(string[] args)
    {
        Hashtable hashCodesSeen = new Hashtable();
        while (true)
        {
            object o = new object();
            int hashCode = o.GetHashCode();
            if (hashCodesSeen.ContainsKey(hashCode))
            {
                Console.WriteLine("Hashcode seen twice: " + hashCode);
                break;
            }
            hashCodesSeen.Add(hashCode, null);
        }

    }
}

When I run this, it spots a reused hash code more or less instantly. (1740 every time, on my system.)

Nevertheless - The Technologist is correct here - the hash code does turn out to be distinct for as long as the object in question is alive. The CLR only recycles a hash code once the object that was using it has gone away.

This surprised me first time I saw it. Surely the CLR has to do a whole load of work to track which references are in use. This is a peculiar thing for it to do when the documentation for GetHashCode doesn't even offer to supply this facility. Far from it - it says:

"The default implementation of GetHashCode does not guarantee uniqueness or consistency; therefore, it must not be used as a unique object identifier for hashing purposes."

So it seems oddly generous for the CLR to be keeping track of every hash code of every live object in order to maintain a feature the documentation explicitly tells us not to expect.

Documentation Rant Begins

Mind you, you've never been able to trust the documentation for Object.GetHashCode. It has changed several times in the history of the CLR, but every time I've looked at it, it has contained some kind of self contradiction. For example, the one up on the web right now says "Derived classes must override GetHashCode with an implementation that returns a unique hash code", which is, as already discussed, impossible for any class with more than 32 bits of information. It's also incompatible with the rule further down the same page saying "If two objects of the same type represent the same value, the hash function must return the same constant value for either object."

The current documentation contains another error which has been there for some time. It says "The hash function must return exactly the same value regardless of any changes that are made to the object." Just to clarify things, it later on says "GetHashCode must always return the same value for a given instance of the object". That seems pretty unequivocal. But think about it for a moment, in the context of the previous rule about equality of value requiring equality of hash codes. How can an object whose value can be changed meet both of these rules? If an object allows all of its properties to be modified, the only possible GetHashCode implementation that meets both of these requirements is one that returns a constant. (But that's useless. And it also violates the other rule "For the best performance, a hash function must generate a random distribution for all input.")

Just in case it's not clear, here's why an implementation that always returns the same number for any instance is the only valid one for wholly mutable objects under the three rules specified in the "Notes to Implementors" section. To save you going there, the rules are:

  1. If two objects of the same type represent the same value, the hash function must return the same constant value for either object.
  2. For the best performance, a hash function must generate a random distribution for all input.
  3. The hash function must return exactly the same value regardless of any changes that are made to the object.

Suppose I have two objects which are of different values, and I read their hash codes. Now I change one of them to be equal to the other one, and I read its hash code again. Because the objects now have the same value, they are required by the first rule to have the same hash code. But if hash codes are not allowed to change regardless of any changes that are made to the object (that's the third rule), then the two objects must have had the same hash codes before I modified the object - otherwise there's no way they could have had the same hash code afterwards too, given that they're not allowed to change.

Taking this further, given that all the properties in my object are mutable, I can make any instance equal to any other instance by simply changing its properties. In every case, the hash codes must, according to the combined effects of rules 1 and 3, be the same before and after. This means that all instances must return the same hash code, regardless of their value.

This is clearly nonsense, and obviously an error in the docs. Indeed, this was acknowledge by a Microsoft guy some time ago - this post from Brad Abrams on the DevelopMentor DOTNET list back in November 2001 acknowledges the problem. The docs have been changed slightly since then, but these changes have failed to address the problem.

So how did this problem come about? The real problem here, believe it or not, is the Hashtable. It requires that the hash code of any objects used as a key remain constant. If you change a key object in such a way as to change its hash value, the relevant item has a tendency to become inaccessible in the Hashtable. Unfortunately, it seems there was at some stage a misguided attempt to 'fix' this problem by making it the responsibility of whoever implements GetHashCode. This is why the rules saying that hash codes are never allowed to change for any given instance were added. But the problem with this is that it means mutable objects cannot fulfil the rules unless they return a constant. (Objects with some immutable state can do slightly better - they can base their hash on the immutable data. But if they are a mixture of mutable and immutable state, the rules as state require the mutable parts to be ignored. This may significantly impair the effectiveness of the hash function in practice.)

The right thing to do would be to document this hash constancy requirement as a feature of the Hashtable rather than trying to push it into the GetHashCode requirements. In fact that's precisely what the documentation for Hashtable does! It says "Key objects must be immutable as long as they are used as keys in the Hashtable." So it looks like the persistent impossible advice in GetHashCode is simply historical accident.

Documentation Rant Ends

So, back to this curious feature of Object's GetHashCode implementation. How and why does it provide this uniqueness characteristic which its documentation tells us that it doesn't provide? If Rotor is anything to go by, this is achieved by returning the sync block index for the object. Sync blocks are created on demand, which means that there is a cost involved the first time you ask an object for its hash code - it has to allocate a sync block. So this cost of tracking IDs is only incurred on objects you actually call GetHashCode on. (And then only if they use the base implementation in Object.)

As for why it does this, that's not entirely clear to me. Possibly it's because bits of the CLR rely on this uniqueness. For example, as mentioned in an earlier entry, the Thread class guarantees that each instance's hash code remains unique as long as the thread is alive. But Thread doesn't actually override GetHashCode, so the only way it can offer this is either through internal CLR magic, or if the base implementation provides the guarantee.

Of course on 64-bit systems, the CLR can't reasonably make this guarantee for all objects. A 64-bit system might have sufficient memory to allow more than 2^32 object instances to remain reachable. (This won't happen on a 32-bit system because there isn't enough address space to hold that many objects - you'd get an out of memory exception if you tried to create that many objects and keep them all alive.) But for threads you're probably safe - even a 64-bit system will probably keel over if you try to create over 4 billion threads.


Copyright © 2002-2007, Interact Software Ltd. Content by Ian Griffiths. Please direct all Web site inquiries to webmaster@interact-sw.co.uk
posted on 2007-08-27 11:04  共同学习,共同进步  阅读(331)  评论(1编辑  收藏  举报