Bloom Filter js version All In One
Bloom Filter js version All In One
布隆过滤器 js 版
布隆过滤器
意义:从海量数据中快速的过滤数据,判断是否能命中该数据;🚀
优点:使用二进制,查找性能高,速度超快!✅
缺点:判断不存在的准确率 100%, 而判断存在的准确率不可靠,有可能错误 ❌
布隆过滤器 (缓存穿透/ 缓存击穿)
查找问题,类似于在海量数据
中查找
某个key是否存在,考虑空间复杂度
和时间复杂度
,一般选用布隆过滤器
来实现。
布隆过滤器是个好东西,有非常多的用途,包括:垃圾邮件识别
、搜索蜘蛛爬虫 url 去重
等,主要借助 K个哈希函数
和一个超大的 bit数组
来降低哈希冲突
本身带来的误判,从而提高识别准确性。
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set".
Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant);
the more items added, the larger the probability of false positives.
布隆过滤器
是一种节省空间的概率数据结构
,由伯顿·霍华德·布鲁姆(Burton Howard Bloom)在1970年提出,用于测试
某个元素
是否为集合
的成员
。
可能会出现假阳性
匹配,但否定否定匹配-换句话说,查询返回“可能在集合中
”或“绝对不在集合中
”。
元素可以添加
到集合中,但不能删除
(尽管可以通过计数 Bloom 过滤器变体解决);
添加的项目越多,误报的可能性就越大。
https://en.wikipedia.org/wiki/Bloom_filter
使用场景
这些使用场景有个共同的需求:如何在有海量数据的数据中查找一条数据是否存在其中?
- 文字处理软件中,需要检查一个英语单词是否拼写正确;
- 在 FBI,一个嫌疑人的名字是否已经在嫌疑名单上;
- 在网络爬虫里,一个网址 url 是否被访问过;
- gmail 等邮箱垃圾邮件过滤功能;
demos
https://www.npmjs.com/package/bloomfilter
https://github.com/jasondavies/bloomfilter.js/blob/master/bloomfilter.js
// Bloom Filter
(function(exports) {
exports.BloomFilter = BloomFilter;
exports.fnv_1a = fnv_1a;
var typedArrays = typeof ArrayBuffer !== "undefined";
// Creates a new bloom filter. If *m* is an array-like object, with a length
// property, then the bloom filter is loaded with data from the array, where
// each element is a 32-bit integer. Otherwise, *m* should specify the
// number of bits. Note that *m* is rounded up to the nearest multiple of
// 32. *k* specifies the number of hashing functions.
function BloomFilter(m, k) {
var a;
if (typeof m !== "number") a = m, m = a.length * 32;
var n = Math.ceil(m / 32),
i = -1;
this.m = m = n * 32;
this.k = k;
if (typedArrays) {
var kbytes = 1 << Math.ceil(Math.log(Math.ceil(Math.log(m) / Math.LN2 / 8)) / Math.LN2),
array = kbytes === 1 ? Uint8Array : kbytes === 2 ? Uint16Array : Uint32Array,
kbuffer = new ArrayBuffer(kbytes * k),
buckets = this.buckets = new Int32Array(n);
if (a) while (++i < n) buckets[i] = a[i];
this._locations = new array(kbuffer);
} else {
var buckets = this.buckets = [];
if (a) while (++i < n) buckets[i] = a[i];
else while (++i < n) buckets[i] = 0;
this._locations = [];
}
}
// See http://willwhim.wpengine.com/2011/09/03/producing-n-hash-functions-by-hashing-only-once/
BloomFilter.prototype.locations = function(v) {
var k = this.k,
m = this.m,
r = this._locations,
a = fnv_1a(v),
b = fnv_1a(v, 1576284489), // The seed value is chosen randomly
x = a % m;
for (var i = 0; i < k; ++i) {
r[i] = x < 0 ? (x + m) : x;
x = (x + b) % m;
}
return r;
};
BloomFilter.prototype.add = function(v) {
var l = this.locations(v + ""),
k = this.k,
buckets = this.buckets;
for (var i = 0; i < k; ++i) buckets[Math.floor(l[i] / 32)] |= 1 << (l[i] % 32);
};
BloomFilter.prototype.test = function(v) {
var l = this.locations(v + ""),
k = this.k,
buckets = this.buckets;
for (var i = 0; i < k; ++i) {
var b = l[i];
if ((buckets[Math.floor(b / 32)] & (1 << (b % 32))) === 0) {
return false;
}
}
return true;
};
// Estimated cardinality.
BloomFilter.prototype.size = function() {
var buckets = this.buckets,
bits = 0;
for (var i = 0, n = buckets.length; i < n; ++i) bits += popcnt(buckets[i]);
return -this.m * Math.log(1 - bits / this.m) / this.k;
};
// http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
function popcnt(v) {
v -= (v >> 1) & 0x55555555;
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
return ((v + (v >> 4) & 0xf0f0f0f) * 0x1010101) >> 24;
}
// Fowler/Noll/Vo hashing.
// Nonstandard variation: this function optionally takes a seed value that is incorporated
// into the offset basis. According to http://www.isthe.com/chongo/tech/comp/fnv/index.html
// "almost any offset_basis will serve so long as it is non-zero".
function fnv_1a(v, seed) {
var a = 2166136261 ^ (seed || 0);
for (var i = 0, n = v.length; i < n; ++i) {
var c = v.charCodeAt(i),
d = c & 0xff00;
if (d) a = fnv_multiply(a ^ d >> 8);
a = fnv_multiply(a ^ c & 0xff);
}
return fnv_mix(a);
}
// a * 16777619 mod 2**32
function fnv_multiply(a) {
return a + (a << 1) + (a << 4) + (a << 7) + (a << 8) + (a << 24);
}
// See https://web.archive.org/web/20131019013225/http://home.comcast.net/~bretm/hash/6.html
function fnv_mix(a) {
a += a << 13;
a ^= a >>> 7;
a += a << 3;
a ^= a >>> 17;
a += a << 5;
return a & 0xffffffff;
}
})(typeof exports !== "undefined" ? exports : this);
https://www.npmjs.com/package/bloom-filters
https://github.com/Callidon/bloom-filters
https://github.com/Callidon/bloom-filters/blob/master/src/bloom/bloom-filter.ts
// Bloom Filter
/* file : bloom-filter.ts
MIT License
Copyright (c) 2017 Thomas Minier & Arnaud Grall
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/
import ClassicFilter from '../interfaces/classic-filter'
import BaseFilter from '../base-filter'
import BitSet from './bit-set'
import {AutoExportable, Field, Parameter} from '../exportable'
import {optimalFilterSize, optimalHashes} from '../formulas'
import {HashableInput} from '../utils'
/**
* A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970,
* that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
*
* Reference: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
* @see {@link http://crystal.uta.edu/~mcguigan/cse6350/papers/Bloom.pdf} for more details about classic Bloom Filters.
* @author Thomas Minier
* @author Arnaud Grall
*/
@AutoExportable<BloomFilter>('BloomFilter', ['_seed'])
export default class BloomFilter
extends BaseFilter
implements ClassicFilter<HashableInput>
{
@Field()
public _size: number
@Field()
public _nbHashes: number
@Field<BitSet>(
f => f.export(),
data => {
// create the bitset from new and old array-based exported structure
if (Array.isArray(data)) {
const bs = new BitSet(data.length)
data.forEach((val: number, index: number) => {
if (val !== 0) {
bs.add(index)
}
})
return bs
} else {
return BitSet.import(data as {size: number; content: string})
}
}
)
public _filter: BitSet
/**
* Constructor
* @param size - The number of cells
* @param nbHashes - The number of hash functions used
*/
constructor(
@Parameter('_size') size: number,
@Parameter('_nbHashes') nbHashes: number
) {
super()
if (nbHashes < 1) {
throw new Error(
`A BloomFilter cannot uses less than one hash function, while you tried to use ${nbHashes}.`
)
}
this._size = size
this._nbHashes = nbHashes
this._filter = new BitSet(size)
}
/**
* Create an optimal bloom filter providing the maximum of elements stored and the error rate desired
* @param nbItems - The maximum number of item to store
* @param errorRate - The error rate desired for a maximum of items inserted
* @return A new {@link BloomFilter}
*/
public static create(nbItems: number, errorRate: number): BloomFilter {
const size = optimalFilterSize(nbItems, errorRate)
const hashes = optimalHashes(size, nbItems)
return new this(size, hashes)
}
/**
* Build a new Bloom Filter from an existing iterable with a fixed error rate
* @param items - The iterable used to populate the filter
* @param errorRate - The error rate, i.e. 'false positive' rate, targeted by the filter
* @param seed - The random number seed (optional)
* @return A new Bloom Filter filled with the iterable's elements
* @example
* ```js
* // create a filter with a false positive rate of 0.1
* const filter = BloomFilter.from(['alice', 'bob', 'carl'], 0.1);
* ```
*/
public static from(
items: Iterable<HashableInput>,
errorRate: number,
seed?: number
): BloomFilter {
const array = Array.from(items)
const filter = BloomFilter.create(array.length, errorRate)
if (typeof seed === 'number') {
filter.seed = seed
}
array.forEach(element => filter.add(element))
return filter
}
/**
* Get the optimal size of the filter
* @return The size of the filter
*/
get size(): number {
return this._size
}
/**
* Get the number of bits currently set in the filter
* @return The filter length
*/
public get length(): number {
return this._filter.bitCount()
}
/**
* Add an element to the filter
* @param element - The element to add
* @example
* ```js
* const filter = new BloomFilter(15, 0.1);
* filter.add('foo');
* ```
*/
public add(element: HashableInput): void {
const indexes = this._hashing.getIndexes(
element,
this._size,
this._nbHashes,
this.seed
)
for (let i = 0; i < indexes.length; i++) {
this._filter.add(indexes[i])
}
}
/**
* Test an element for membership
* @param element - The element to look for in the filter
* @return False if the element is definitively not in the filter, True is the element might be in the filter
* @example
* ```js
* const filter = new BloomFilter(15, 0.1);
* filter.add('foo');
* console.log(filter.has('foo')); // output: true
* console.log(filter.has('bar')); // output: false
* ```
*/
public has(element: HashableInput): boolean {
const indexes = this._hashing.getIndexes(
element,
this._size,
this._nbHashes,
this.seed
)
for (let i = 0; i < indexes.length; i++) {
if (!this._filter.has(indexes[i])) {
return false
}
}
return true
}
/**
* Get the current false positive rate (or error rate) of the filter
* @return The current false positive rate of the filter
* @example
* ```js
* const filter = new BloomFilter(15, 0.1);
* console.log(filter.rate()); // output: something around 0.1
* ```
*/
public rate(): number {
return Math.pow(1 - Math.exp(-this.length / this._size), this._nbHashes)
}
/**
* Check if another Bloom Filter is equal to this one
* @param other - The filter to compare to this one
* @return True if they are equal, false otherwise
*/
public equals(other: BloomFilter): boolean {
if (this._size !== other._size || this._nbHashes !== other._nbHashes) {
return false
}
return this._filter.equals(other._filter)
}
}
refs
https://www.cnblogs.com/xgqfrms/p/13490357.html
©xgqfrms 2012-2020
www.cnblogs.com/xgqfrms 发布文章使用:只允许注册用户才可以访问!
原创文章,版权所有©️xgqfrms, 禁止转载 🈲️,侵权必究⚠️!
本文首发于博客园,作者:xgqfrms,原文链接:https://www.cnblogs.com/xgqfrms/p/16355146.html
未经授权禁止转载,违者必究!