从哈希存储到Bloom Filter

先解释一下什么是哈希函数。哈希函数简单来说就是一种映射，它可取值的范围（定义域）通常很大，但值域相对较小。哈希函数所作的工作就是将一个很大定义域内的值映射到一个相对较小的值域内。传统的哈希存储

假设要哈希的集合为S，它有n个元素。传统的哈希方法是，将哈希区域组织成h（h > n）个格子的列表，每一个格子都能存储S中的一个元素。存储时将S中的每一个元素映射到{0, 1, … , h-1}的范围内，然后以这个值为索引将此元素存储到对应的格子内。由于哈希函数将一个大集合映射到一个小集合中，所以存在将大集合中的多个元素映射到同一位置的情况，这就是所谓的碰撞（Collision）。当碰撞发生时，有多种策略可供选择，比如用链表将映射到同一位置的元素串起来，或者在碰撞发生时再进行哈希映射直到找到空位为止等等。

传统的哈希方法不会发生错误，而且存储的元素还可以复原。如果哈希函数选择得当，碰撞出现的情况比较少，那么查找某一个元素也很快。但是，如果你哈希某个集合只是为了判断某个元素是否在这个集合中，那么你会发现好像存储整个集合有点浪费。按传统的哈希方法判断某个元素是否属于集合时，会把这个元素和它映射位置上的元素进行匹配，如果完全匹配则说明属于集合，如果不匹配则不属于。在绝大部分查找都不能匹配的情况下（这常常是实际中的情况），我们会发现匹配的过程经常用不到整个元素，因为元素的一部分就可以判断不匹配了。基于“部分信息就能判断不匹配”这个思路，Burton Bloom（Bloom Filter的发明者）提出了一种改进的方法。

改进的哈希存储

在这种改进的方法中，哈希区域和前面一样仍然被组织成格子的列表。但这次并不直接将集合元素存在格子里，而是将每一个元素编码然后将编码存在格子里。假设每个集合元素要占b位，编码后要占c（c < b）位。由于编码位数少于元素位数，不同元素的编码有可能相同，因此在查找元素时可能会出现错误。编码位数取决于你期望的错误率：编码位数越多，错误就越少，反之则越大；当错误少到一定程度（大约2-b），编码位数就足以存下整个元素，因此就变回了传统的哈希存储。

这种方法对传统的哈希存储进行了改良，允许用户在错误率和存储空间之间作权衡。这里我们已经能够看到Bloom Filter的一点端倪。如果说这种方法已经孕育了“正确率换空间”的思想的话，那么Bloom Filter更是这个思想的大胆实践，它完全摆脱了传统的哈希存储方法，在存储空间使用和减少错误率方面又进了一步。

Bloom Filter

在Bloom Filter中，哈希区域的每一位都被当成是独立的可寻址的单元。在对集合元素进行编码时，同时使用若干个独立的哈希函数，将每一个哈希函数映射的地址都置为1。这种编码方法可谓是另辟蹊径，摆脱了原来一个格子一个格子的存储方法。在改进的哈希存储中，编码位数是和正确率交换的筹码，而在Bloom Filter中，筹码变成了哈希函数的个数以及整个哈希区域（即位数组）的大小。如果想具体知道合适的哈希函数个数和位数组大小，请参阅第一篇Bloom Filter概念和原理。

和前面两种哈希存储方法相比，Bloom Filter最大的优势自然是它的空间效率。另外，由于Bloom Filter不用处理碰撞（Collision），因此它在增加或查找集合元素时所用的时间完全恒定（哈希函数的计算时间），无论集合元素本身有多大，也无论多少集合元素已经加入到了位数组中。由于Bloom Filter和改进的哈希存储都对集合元素进行了编码，因此想要从哈希区域中恢复集合元素并不容易。但同时，如果你不想让别人直接看到集合元素，这样的编码处理倒可以看成是一种加密，有效保护了你的隐私。

Bloom Filter很大的一个缺点就是不能删除元素。由于Bloom Filter不处理碰撞，有可能多个哈希函数都映射到了同一位，因此不能简单地在删除时将1置为0。后面我们会看到，Counting Bloom Filter通过将每一位扩展为一个Counter来解决这一问题。

JAVA 有实现类库： http://code.google.com/p/java-bloomfilter/

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements.

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k − 1) to a hash function that takes an initial value; or add (or append) these values to the key. For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate (Dillinger & Manolios (2004a), Kirsch & Mitzenmacher (2006)). Specifically, Dillinger & Manolios (2004b) show the effectiveness of using enhanced double hashing or triple hashing, variants of double hashing, to derive the k indices using simple arithmetic on two or three indices computed with independent hash functions.

Removing an element from this simple Bloom filter is impossible because false negatives are not permitted. An element maps to k bits, and although setting any one of those k bits to zero suffices to remove the element, it also results in removing any other elements that happen to map onto that bit. Since there is no way of determining whether any other elements have been added that affect the bits for an element to be removed, clearing any of the bits would introduce the possibility for false negatives.

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

今天，我们介绍一种称作布隆过滤器的数学工具，它只需要哈希表 1/8 到 1/4 的大小就能解决同样的问题。

布隆过滤器是由巴顿.布隆于一九七零年提出的。它实际上是一个很长的二进制向量和一系列随机映射函数。我们通过上面的例子来说明起工作原理。

假定我们存储一亿个电子邮件地址，我们先建立一个十六亿二进制（比特），即两亿字节的向量，然后将这十六亿个二进制全部设置为零。对于每一个电子邮件地址 X，我们用八个不同的随机数产生器（F1,F2, ...,F8）产生八个信息指纹（f1, f2, ..., f8）。再用一个随机数产生器 G 把这八个信息指纹映射到 1 到十六亿中的八个自然数 g1, g2, ...,g8。现在我们把这八个位置的二进制全部设置为一。当我们对这一亿个 email 地址都进行这样的处理后。一个针对这些 email 地址的布隆过滤器就建成了。（见下图）

现在，让我们看看如何用布隆过滤器来检测一个可疑的电子邮件地址 Y 是否在黑名单中。我们用相同的八个随机数产生器（F1, F2, ..., F8）对这个地址产生八个信息指纹 s1,s2,...,s8，然后将这八个指纹对应到布隆过滤器的八个二进制位，分别是 t1,t2,...,t8。如果 Y 在黑名单中，显然，t1,t2,..,t8 对应的八个二进制一定是一。这样在遇到任何在黑名单中的电子邮件地址，我们都能准确地发现。

布隆过滤器决不会漏掉任何一个在黑名单中的可疑地址。但是，它有一条不足之处。也就是它有极小的可能将一个不在黑名单中的电子邮件地址判定为在黑名单中，因为有可能某个好的邮件地址正巧对应个八个都被设置成一的二进制位。好在这种可能性很小。我们把它称为误识概率。在上面的例子中，误识概率在万分之一以下。

布隆过滤器的好处在于快速，省空间。但是有一定的误识别率。常见的补救办法是在建立一个小的白名单，存储那些可能别误判的邮件地址。

http://zfsn.iteye.com/blog/747828

package com.huigao.util;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.ObjectInputStream;

import java.io.ObjectOutputStream;

import java.util.BitSet;

public class BloomFilter {

private int defaultSize = 5000 << 10000;

private int basic = defaultSize -1;

private String key = null;

private BitSet bits = new BitSet(defaultSize);

public BitSet getBits() {

return bits;

}

public void setBits(BitSet bits) {

this.bits = bits;

}

public BloomFilter(){

}

public BloomFilter(String key){

this.key = key;

}

private int[] lrandom(String key){

int[] randomsum = new int[8];

int random1 = hashCode(key,1);

int random2 = hashCode(key,2);

int random3 = hashCode(key,3);

int random4 = hashCode(key,4);

int random5 = hashCode(key,5);

int random6 = hashCode(key,6);

int random7 = hashCode(key,7);

int random8 = hashCode(key,8);

randomsum[0] = random1;

randomsum[1] = random2;

randomsum[2] = random3;

randomsum[3] = random4;

randomsum[4] = random5;

randomsum[5] = random6;

randomsum[6] = random7;

randomsum[7] = random8;

return randomsum;

}

/* private int[] sameLrandom(){

int[] randomsum = new int[8];

int random1 = hashCode(key,1);

int random2 = hashCode(key,1);

int random3 = hashCode(key,1);

int random4 = hashCode(key,1);

int random5 = hashCode(key,1);

int random6 = hashCode(key,1);

int random7 = hashCode(key,1);

int random8 = hashCode(key,1);

randomsum[0] = random1;

randomsum[1] = random2;

randomsum[2] = random3;

randomsum[3] = random4;

randomsum[4] = random5;

randomsum[5] = random6;

randomsum[6] = random7;

randomsum[7] = random8;

return randomsum;

} */

private void add(String key){

if(exist( key)){

System.out.println("已经包含("+key+")");

return;

}

int keyCode[] = lrandom(key);

bits.set(keyCode[0]);

bits.set(keyCode[1]);

bits.set(keyCode[2]);

bits.set(keyCode[3]);

bits.set(keyCode[4]);

bits.set(keyCode[5]);

bits.set(keyCode[6]);

bits.set(keyCode[7]);

}

private boolean exist(String key){

int keyCode[] = lrandom(key);

if(bits.get(keyCode[0])&&

bits.get(keyCode[1])

&&bits.get(keyCode[2])

&&bits.get(keyCode[3])

&&bits.get(keyCode[4])

&&bits.get(keyCode[5])

&&bits.get(keyCode[6])

&&bits.get(keyCode[7])){

return true;

}

return false;

}

// private boolean set0(){

// if(exist()){

// int keyCode[] = lrandom();

// bits.clear(keyCode[0]);

// bits.clear(keyCode[1]);

// bits.clear(keyCode[2]);

// bits.clear(keyCode[3]);

// bits.clear(keyCode[4]);

// bits.clear(keyCode[5]);

// bits.clear(keyCode[6]);

// bits.clear(keyCode[7]);

// return true;

// }

// return false;

// }

private int hashCode(String key,int Q){

int h = 0;

int off = 0;

char val[] = key.toCharArray();

int len = key.length();

for (int i = 0; i < len; i++) {

h = (30 + Q) * h + val[off++];

}

return changeInteger(h);

}

private int changeInteger(int h) {

return basic & h;

}

public void saveBit(String filename){

try {

File file=new File(filename);

ObjectOutputStream oos=new ObjectOutputStream(new FileOutputStream(file,false));

oos.writeObject(bits);

oos.flush();

oos.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

}

public BitSet readBit(String filename){

BitSet bits=new BitSet(defaultSize);

File file=new File(filename);

if(!file.exists()){

return bits;

}

try {

ObjectInputStream ois=new ObjectInputStream(new FileInputStream(file));

bits=(BitSet)ois.readObject();

ois.close();

} catch (FileNotFoundException e) {

e.printStackTrace();

} catch (IOException e) {

e.printStackTrace();

} catch (ClassNotFoundException e) {

e.printStackTrace();

}

return bits;

}

public static void main(String[] args) {

String fileName="c:\\test\\BloomFilter.txt";

String url="http://www.agrssdddd.com/";

BloomFilter bf=new BloomFilter();

BitSet bitSet=bf.readBit(fileName);

bf.setBits(bitSet);

bf.add(url);

System.out.println(bf.exist(url));

bf.saveBit(fileName);

/* BloomFilter f = new BloomFilter("http://www.agrilink.cn/");

f.add();

System.out.println(f.exist()); */

// f.set0();

// System.out.println(f.exist());

}

posted on 2011-08-15 10:19 zhizhesky 阅读(350) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

半夜ATM机前看书的那位

导航

公告

从哈希存储到Bloom Filter