Hashing
1. Introduction
Hashing is a technique used for performing insertions, deletions, and finds in constant average time.
Hash function
Each key is mapped into some number in the range 0 to TableSize -1 and placed in the appropriate cell. And this mapping is called a hash function since there are a finite number of cells and a virtually inexhaustible supply of keys, this is clearly impossible, and thus we seek a hash function that distributes the keys evenly among the cells.
The only remaining problems :
- deal with choosing a function
- deciding what to do when two keys hash to the same value
2. Hash Function
if the input keys are integers, then simply returning key mod TableSize is generally a reasonable strategy, unless Key happens to have some undesirable properties. In this case, the choice of hash function needs to be carefully considered. For instance, if the table size the is 10 and the keys all end in zero, then the standard hash function is a bad choice. For reasons we shall see later, and to avoid situations like the one above, it is often a good idea to ensure that the table size is prime.
Usually, the Keys are strings; in this case, the hash function needs to be chosen carefully. The best way use Horner's rule. If the keys are very long, the hash function will take too long to compute. A common practice in this case is not to use all the characters. Some programmers implement their hash function by using only the characters in the odd spaces, with the idea that the time saved computing the hash function will make up for a slightly less evenly distributed function.
The main programming detail left is collision resolution. If, when an element is inserted, it hashes to the same value as an already inserted element, then we have a collision and need to resolve it. There are several methods for dealing with this. We will discuss tow of the simplest: separate chaining and open addressing.
3. Separate Chaining
The first strategy, commonly known as separate chaining, is to keep a list of all elements that hash to the same value.
- To perform a search, we use the hash function to determine which list to traverse. we then search the appropriate list.
- To perform a insert, we check the appropriate list to see whether the element is already in place. If the element turns out to be new, it can be inserted at the front of the list, since it is convenient and also because frequently it happens that recently inserted elements are the most likely to be accessed in the near future.
4. Hash Tables without Linked Lists
Separate chaining hashing has the disadvantage of using linked lists. This could slow the algorithm down a bit because of the time required to allocate new cells, and also essentially requires the implementation of a second data structure.
There are three common collision resolution strategies alternatively:
- Linear Probing
- Quadratic Probing
- Double Hashing
5. Rehashing
If the table gets too full, the running time for the operations will start taking too long and insertions might fail for open addressing hashing with quadratic resolution. This can happen if there are too many removals intermixed with insertions. A solution, then, is to build another talbe that is about twice as big and scan down the entire original hash table, computing the new hash value for each element and inserting it in the new table.
This entire operation is called rehashing.
Rehashing can be implemented in several ways with quadratic probing:
- rehash as soon as the table is half full
- rehash only when an insertion fails
- rehash when the table reaches a certain load factor, Since performance does degrade as the load factor increase, the third strategy, implemented with a good cutoff, could be best.