关于敏感哈希

Locality Sensitive Hashing

nearest neighbor

given a set P of n points in $\cal R^d$

Want to build a data structure to answer nearest neighbor queries

Voronoi Diagram

Build a Voronoi diagram & a point location data structure

Voronoi图，又叫泰森多边形或Dirichlet图，它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。

Curse of dimensionality

In R2 the Voronoi diagram is of size O(n)

Query takes O(logn) time

In Rd the complexity is $O(n^{d/2})$

Other techniques also scale bad with the dimension

Locality Sensitive Hashing

We will use a family of hash functions such that close points tend to hash to the same bucket.

Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

Def (Charikar):

A family H of functions is locality sensitive with respect to a similarity function 0 ≤ sim(p,q) ≤ 1 if

\[Pr[h(p) = h(q)] = sim(p,q) \]

Example – Hamming Similarity

Think of the points as strings of m bits and consider the similarity $sim(p,q) = 1-ham(p,q)/m$

$H={h_i(p) = the\ i-th\ bit\ of\ p} $is locality sensitive wrt

$sim(p,q) = 1-ham(p,q)/m$

$Pr[h(p) = h(q)] = 1 – ham(p,q)/m $

$1-sim(p,q) = ham(p,q)/m$

Example - Jaacard

Think of p and q as sets

$sim(p,q) = jaccard(p,q) = |p\cap q|/|p\cup q|$

$H=\{h_{\pi}(p) = \min \ in\ \pi\ of \ the \ items \ in \ p\}$

$Pr[h_{\pi}(p) = h_{\pi}(q)] = jaccard(p,q)$

Need to pick $\pi $ from a min-wise ind. family of permutations

Map to {0,1}

Draw a function b to 0/1 from a pairwise ind. family B

So: $h(p)\neq h(q) \Rightarrow b(h(p)) = b(h(q)) = 1/2$

$H’=\{b(h()) | h\in H, b\in B\}$

Another example (“simhash”)

\[H =\{h_r(p) = 1 ,if \ r·p > 0, 0\ otherwise | r \ is\ a \ random\ unit \ vector\} \]

How do we really use it?

Reduce the number of false positives by concatenating hash function to get new hash functions (“signature”)

\[sig(p) = h1(p)h2(p) h3(p)h4(p)…… = 00101010 \]

Very close documents are hashed to the same bucket or to ‘’close” buckets $(ham(sig(p),sig(q))$ is small)

See papers on removing almost duplicates…

A theoretical result on NN

Locality Sensitive Hashing

Thm: If there exists a family H of hash functions such that

$Pr[h(p) = h(q)] = sim(p,q)$

then $d(p,q) = 1-sim(p,q)$ satisfies the triangle inequality

Alternative Def (Indyk-Motwani):

A family H of functions is $(r1 < r2,p1 > p2)$-sensitive if

$d(p,q) ≤ r1 \rightarrow Pr[h(p) = h(q)] ≥ p1 $
$d(p,q) ≥ r2 \rightarrow Pr[h(p) = h(q)] ≤ p2 $

If $d(p,q) = 1-sim(p,q) \ then \ this \ holds \ with \ p_1 = 1-r1\ and \ p2=1-r_2 \ for\ any \ r_1, r_2 $

(r,ε)-neighbor problem

If there is a neighbor p, such that d(p,q)$\leq$r, return p’, s.t. d(p’,q) $\leq$ (1+ε)r.
If there is no p s.t. d(p,q)$\leq$(1+ε)r return nothing.

is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

Lets construct a data structure that succeeds with constant probability
Focus on the hamming distance first

NN using locality sensitive hashing

Take a (r1 < r2, p1 > p2) = (r < (1+$\epsilon$)r, 1-r/m > 1-(1+$\epsilon$)r/m) - sensitive family

If there is a neighbor at distance r we catch it with probability $p_1$so to guarantee catching it we need $1/p_1 $functions..

But we also get false positives in our 1/p1 buckets, how many ? np2/p1

注1：我以后描述点集的方式可以先整个Voronoi图

注2：min-wise没看懂

注3：concatenating连接

注4：Alternative Def 另一种定义

posted @ 2020-08-06 00:07 HuangB2ydjm 阅读(182) 评论(0) 收藏举报

刷新页面返回顶部

大风起兮云飞扬😋

何不食肉糜

关于敏感哈希

Locality Sensitive Hashing

nearest neighbor

Voronoi Diagram

Curse of dimensionality

Locality Sensitive Hashing

Example – Hamming Similarity

Example - Jaacard

Map to {0,1}

Another example (“simhash”)

How do we really use it?

A theoretical result on NN

Locality Sensitive Hashing

(r,ε)-neighbor problem

NN using locality sensitive hashing

公告