Fork me on GitHub

布隆算法原理

 

场景:BloomFilter--大规模数据排重算法

优点:空间利用率高,保存不是数据本身,安全性好;

缺点:插入数据越大,误判率越高,不能删除元素

应用场景:防缓存击穿(数以十亿级别数据量,将不太适合用redis等缓存)

 

整体思路:举个例子,对url做过滤排重。

1、创建一个空的Bitmap集合

 

2、对url多次hash,一般为8次;

3、将hash结果放入BitMap集合:

 

同理如果有第二个Url,相同处理:

 

 

 4、判断标准,将url通过r=HashA、HashB、HashC得到的结果,在BitMap[r] == 1(全部为1),认为重复;

误判是新的url,经过Hash,很可能出现:5,9,12,会把新的url认为重复。可以建立误判白名单。

 

demo:

package com.example.demo.bloomFilter;

import java.util.BitSet;

public class BloomFilter {
    /**
     * 默认长度  2 * Math.pow(2,24)
     */
    private static final int DEFAULT_SIZE = 2 << 24;

    /**
     * 为质数,减少碰撞,原因:
     * 3: 0011
     * 5: 0101
     */
    private static final int seeds[] = new int[]{3, 5, 7, 9, 11, 13, 17, 19};
    private static Hash[] hashAr = new Hash[8];

    static {
        for (int i = 0; i < seeds.length; i++) {
            hashAr[i] = new Hash(seeds[i]);
        }
    }

    /**
     * hash方法结果记录到bitSet
     */
    private BitSet bitSet = new BitSet(DEFAULT_SIZE);

    /**
     * 将String经过Hash,结果放入bitSet
     *
     * @param content
     */
    public void add(String content) {
        for (Hash h : hashAr) {
            bitSet.set(h.getHash(content));
        }
    }

    /**
     * 是否包含
     *
     * @param content
     * @return
     */
    public boolean contains(String content) {
        boolean have = true;
        for (Hash hash : hashAr) {
            have &= bitSet.get(hash.getHash(content));
        }
        return have;
    }

    public static void main(String[] args) {
        String email="xiaozhuanfeng@126.com";
        BloomFilter bloomDemo=new BloomFilter();
        System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
        bloomDemo.add(email);
        System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
        email="xiaozhuanfeng@163.com";
        System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
    }

    private static class Hash {
        private int seed = 0;

        public Hash(int seed) {
            this.seed = seed;
        }

        public int getHash(String string) {
            int val = 0;
            int len = string.length();
            for (int i = 0; i < len; i++) {

                //与质数相乘+Assic码
                val = val * seed + string.charAt(i);
            }

            //长度为(2的次幂-1),减少碰撞
            //注意:&& 和&的区别(&& 第一表达式flase,第二表达式就不执行了,所以如果类似  val &= function()要注意)
            return val & (DEFAULT_SIZE - 1);
        }
    }
}

 

参考:

https://mp.weixin.qq.com/s?__biz=MzIxMjE5MTE1Nw==&mid=2653191316&idx=1&sn=6b407704c99bda58440e97a2d6dd6ee9&chksm=8c990e4ebbee8758bf207b7fed8267bc1bda957f5864c00b467e2de6f0ae93563740b5527f25&mpshare=1&scene=1&srcid=0927TOixl26f0xogheOaXM1x&key=c38ae561692275b4c85347d76b993d2eeb8bdeaea465676770fb28835462fbc7d92f66816cbf4adb29af15b479e88b00109901f88a846c4c5c921bd228fd1dfa37cdee015d81561d5052c7f31230447c&ascene=0&uin=MjE4MTczNDcwMA%3D%3D&devicetype=iMac+MacBookAir6%2C1+OSX+OSX+10.12.5+build(16F73)&version=12020

posted @ 2019-05-13 19:07  小传风  阅读(912)  评论(0编辑  收藏  举报