Java进阶之HashMap剖析

一、基础回顾

1.Map接口的几种实现：

HashMap

LinkedHashMap

TreeMap

ConcurrentHashMap

WeakHashMap

IndentityHashMap

HashMap不是线程安全的，ConcurrentHashMap是线程安全的

2.java 中的equals 与 hashcode

Java 中所有的类都继承自object, object中有两个最基本的方法：equals 与 hashcode

Equals :是对象是否相等的比较，对于基本的类型byte,short,char,int,long,float,double,boolean等是基于内容的比较，但是对于复合类型，就是对于对象地址的比较。String 、Math、Integer、Double等这些封装类在使用equals()方法时，已经覆盖了object类的equals()方法。比如可以看看String中自己覆盖的equals()方法：

public boolean equals(Object anObject) {
        if (this == anObject) {
            return true;
        }
        if (anObject instanceof String) {
            String anotherString = (String)anObject;
            int n = value.length;
            if (n == anotherString.value.length) {
                char v1[] = value;
                char v2[] = anotherString.value;
                int i = 0;
                while (n-- != 0) {
                    if (v1[i] != v2[i])
                        return false;
                    i++;
                }
                return true;
            }
        }
        return false;
    }

View Code

可以看出首先是比较了地址，如果地址相同那么肯定是两个相同的字符串；如果不同则比较字符串内容。

Hashcode: 返回对象的hash code 值

String Boolean Integer也都复写了自己的HashCode()方法：

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

View Code

可以看见，对string 而言，hashcode()是基于string的内容的。

hashCode用于查找，而equals是用于比较两个对象的是否相等

记住：

1）对同一个对象调用hashcode，应该产生相同的值；

2）但是两个对象的hashcode相同，但他们不一定相等（因为他们可能散列到相同的哈希码）

3. hashset hashmap怎么实现插入非重复的元素？

hashset hashmap 都是不允许插入重复的元素的，那么当一个元素来的时候，怎么判断它是否与当前集合中的元素是否重复呢？

如果依次一个个比较，当集合元素很多时，效率肯定非常低下，那么怎么才能提高效率呢？

因为之前我们已经知道，相同的对象的hashcode肯定是相同的，那么我们可以先计算对象的hashcode值，计算非常快，o(1)，如果不相等，那么肯定是不相同的对象，可以直接不允许插入；如果相等，这时也不一定是相同的对象（因为可能因为冲突产生相同的），这时还需要进行equals()比较，如果不相等才是真正的不想等。

总结一下，集合中比较两个对象是否相等的步骤：

1）.判断两个对象的hashCode是否相等

             如果不相等，认为两个对象也不相等，完毕
            如果相等，转入2）
           （这一点只是为了提高存储效率而要求的，其实理论上没有也可以，但如果没有，实际使用时效率会大大降低。）

2）.判断两个对象用equals运算是否相等

如果不相等，认为两个对象也不相等
如果相等，认为两个对象相等（equals()是判断两个对象是否相等的关键）

为什么是两条准则，难道用第一条不行吗？不行，因为前面已经说了，hashcode()相等时，equals()方法也可能不等，所以必须用第2条准则进行限制，才能保证加入的为非重复元素。

二、自己来写一个hashmap

继承AbstractMap，我们现自己尝试写一个hashmap，主要是需要实现put() get() entrySet()几个接口，数据结构我们可以直接采用使用数组存放哈希映射的位置，其中每一个元素是<key, value>的implements Map.Entry的entry；

计算哈希映射值，我就是直接对数组大小取模，一般会考虑将数组大小设置为质数，这样导致冲突的概率会小一点。

但是由于数组容易产生冲突，采用链地址法解决冲突：可以使用每个数组的index处是一个链表，如果产生冲突了就直接散列到链表末尾，如下图所示：

然后还要实现Cloneable, Serializable两个接口

public class myHashMap<K, V> extends AbstractMap<K, V> {
    class MapEntry<K, V> implements Map.Entry<K, V> {
        private K key;
        private V value;
        public MapEntry(K k, V v) {
            key = k;
            value = v;
        }
        public K getKey() {return key;}
        public V getValue() {return value;}
        public V setValue(V v) {
            V prevValue = value;
            value = v;
            return prevValue;
        }
        public int hashCode() {
            return (key == null ? 0 : key.hashCode()) ^ (value == null? 0 : value.hashCode());
        }
        public boolean equals(Object o) {
            // if not a instance of the class, of course not equal
            if (!(o instanceof MapEntry)) return false;
            MapEntry me = (MapEntry)o;
            return (key == null ? me.getKey() == null : key.equals(me.getKey())) &&
                    (value == null ? me.getValue() == null : value.equals(me.getValue()));
        }
    }
    static final int SIZE = 997;
    LinkedList<MapEntry<K, V>>[] buckets = new LinkedList[SIZE];

    /*
    if oldvalue is null, add directly;
    else replace oldvalue with newvalue
     */
    public V put(K key, V value) {
        V oldValue = null;
        int index = Math.abs(key.hashCode()) % SIZE;
        if (buckets[index] == null) buckets[index] = new LinkedList<>();
        LinkedList<MapEntry<K, V>> bucket = buckets[index];
        MapEntry<K, V> newPair = new MapEntry<K, V>(key, value);
        ListIterator<MapEntry<K, V>> iter = bucket.listIterator();
        boolean find = false;
        while (iter.hasNext()) {
            MapEntry<K, V> curPair = iter.next();
            if (curPair.getKey().equals(newPair.getKey())) {
                find = true;
                oldValue = curPair.getValue();
                iter.set(newPair);
                break;
            }
        }
        if (!find) buckets[index].add(newPair);
        return oldValue;
    }
    public V get(Object key) {
        int index = Math.abs(key.hashCode()) % SIZE;
        for (MapEntry<K, V> curPair : buckets[index]) {
            if (curPair.getKey().equals(key)) return curPair.getValue();
        }
        return null;
    }

    public Set<Map.Entry<K, V>> entrySet() {
        Set<Map.Entry<K, V>> set = new HashSet<>();
        for (LinkedList<MapEntry<K, V>> bucket : buckets) {
            if (bucket == null) continue;
            for (Entry<K, V> pair : bucket) set.add(pair);
        }
        return set;
    }

    public static void main(String[] agrs) {
        myHashMap<String, String> myHashMap = new myHashMap<>();
        myHashMap.put("shaw", "SDE");
        System.out.println(myHashMap.get("shaw"));
    }
}

View Code

三、自己的hashmap与源码中的hashmap对比

1. 我写的所欠缺的地方：

一个是数据结构采用的不一样：HashMap是数组+链表+红黑树（JDK1.8增加了红黑树部分）实现的；二个是hash机制不一样；同时我也没有考虑桶的容量。

2. 阅读JDK1.8源码

数据结构：

可以发现它用的存储结构也是采用哈希桶数组的形式，即transient Node<K,V>[] table的形式

static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;    //用来定位数组索引位置
        final K key;
        V value;
        Node<K,V> next;   //链表的下一个node

        Node(int hash, K key, V value, Node<K,V> next) { ... }
        public final K getKey(){ ... }
        public final V getValue() { ... }
        public final String toString() { ... }
        public final int hashCode() { ... }
        public final V setValue(V newValue) { ... }
        public final boolean equals(Object o) { ... }
}

View Code

解决冲突是哈希映射的关键问题，当数组容量比较大时，散列比较分散，存取会比较慢；但当数组较小时，又会产生较多的冲突，所以应该怎么找一个效率和空间的权衡呢？

这正是hashmap的精髓：好的Hash算法和扩容机制

需要进行扩容，就应该要知道什么时候进行扩容；HashMap设置了几个字段：

     int threshold;             // 所能容纳的key-value对极限 
     final float loadFactor;    // 负载因子
     int modCount;  
     int size;

View Code

包括开始构造函数的时候参数就包括：public HashMap(int initialCapacity, float loadFactor）

让我们来看看这两个字段有什么用：

threshold是HashMap所能容纳的最大数据量的Node(键值对)个数。threshold = length * Load factor。

也就是说，在数组定义好长度之后，负载因子越大，所能容纳的键值对个数越多。

如果键值对超过这个数目就会扩容，扩容后的HashMap容量是之前容量的两倍。

当链表长度太长（默认超过8）时，链表就转换为红黑树

哈希算法实现：

1）定位数组索引：

看看别人怎么实现的：

    static final int hash(Object key) {
        int h;
        // 相当于 h = key.hashcode()
        // h ^ (h >>> 16) 
        // >>> 无符号右移运算
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); 
    }

可以看见这里的hash算法就是：取key的hashcode，高位异或运算，取模运算。

这个的取模运算在JDK1.8中的实现是放在了put()函数中的：p = tab[i = (n - 1) & hash] 其中(n - 1) & hash即是取模运算。因为table的长度总是2的n次方，故(n - 1) & hash等价于hash % n，但是&比%效率要高的多。不得不佩服源码吧！

HashMap的扩容机制：

我们来看一下hashmap的resize()源码 (jdk1.8)：

/**
     * Initializes or doubles table size.  If null, allocates in
     * accord with initial capacity target held in field threshold.
     * Otherwise, because we are using power-of-two expansion, the
     * elements from each bin must either stay at same index, or move
     * with a power of two offset in the new table.
     *
     * @return the table
     */
    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;
        int newCap, newThr = 0;
        if (oldCap > 0) {
            if (oldCap >= MAXIMUM_CAPACITY) {
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                newThr = oldThr << 1; // double threshold
        }
        else if (oldThr > 0) // initial capacity was placed in threshold
            newCap = oldThr;
        else {               // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if (oldTab != null) {
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                if ((e = oldTab[j]) != null) {
                    oldTab[j] = null;
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }

View Code

HashMap的扩展机制是扩展为原来的两倍，即2次幂的扩展。那么计算hash值的时候(n - 1) & hash，n-1扩展了2倍，相当于右移一位，则(n - 1) & hash的mask范围在高位就要多一位了，如下图所示：

由于多了一位bit，那么重新hash后的就会发生这样的变化：如果原来的hash值新增的那位bit是0，则索引就没有变，如果是1，则索引变成：“原索引+oldCap”

总结：

HashMap 内部是基于一个数组来实现的，数组中的每个元素称为一个桶(bucket)。当数组中被占用的桶的数量超过了装载因子和数组容量设定的阈值后，会对数组进行扩容，容量将扩展为原来的2倍。哈希表中所有的 Entry 会被重新散列到新的位置中。

因为两个不同的key在散列时有可能发生冲突，HashMap为了避免哈希冲突带来的影响做了几点优化。在进行散列处理时，将高位与低位进行异或，从而减小冲突的概率。当不同的node被散列到同一个桶中时，每个桶中使用单向链表的方式来保存数据。在Java 8 的实现中，如果一个桶中的Node数量超过了阈值(TREEIFY_THRESHOLD = 8)，就会将单链表转化为红黑树，当低于阈值(UNTREEIFY_THRESHOLD = 6)时重新转化为单链表。

分析了HashMap的resize方法可以知道，HashMap在进行扩容时是非常耗性能的操作，所以在使用HashMap的时候，应该先估算一下map的大小，初始化的时候给一个大致的数值，避免map进行频繁的扩容。

四、一些面试问题：

1. HashMap 是线程安全的吗？为什么不是线程安全的？

不是，因为resize()的时候会出现环形链表，导致死循环

2.HashMap 的扩容过程

参见上面

参考文章：

https://tech.meituan.com/java-hashmap.html

posted @ 2018-04-04 13:58 shawshawwan 阅读(280) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

进击的算法

attaking algorithm & big data

Java进阶之HashMap剖析