关于敏感哈希

Locality Sensitive Hashing

nearest neighbor

given a set P of n points in \(\cal R^d\)

Want to build a data structure to answer nearest neighbor queries

Voronoi Diagram

Build a Voronoi diagram & a point location data structure

Voronoi图,又叫泰森多边形或Dirichlet图,它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。

img

Curse of dimensionality

In R2 the Voronoi diagram is of size O(n)

Query takes O(logn) time

In Rd the complexity is \(O(n^{d/2})\)

Other techniques also scale bad with the dimension

Locality Sensitive Hashing

We will use a family of hash functions such that close points tend to hash to the same bucket.

Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

Def (Charikar):

A family H of functions is locality sensitive with respect to a similarity function 0 ≤ sim(p,q) ≤ 1 if

\[Pr[h(p) = h(q)] = sim(p,q) \]

Example – Hamming Similarity

Think of the points as strings of m bits and consider the similarity \(sim(p,q) = 1-ham(p,q)/m\)

$H={h_i(p) = the\ i-th\ bit\ of\ p} $is locality sensitive wrt

\(sim(p,q) = 1-ham(p,q)/m\)

$Pr[h(p) = h(q)] = 1 – ham(p,q)/m $

\(1-sim(p,q) = ham(p,q)/m\)

Example - Jaacard

Think of p and q as sets

\(sim(p,q) = jaccard(p,q) = |p\cap q|/|p\cup q|\)

\(H=\{h_{\pi}(p) = \min \ in\ \pi\ of \ the \ items \ in \ p\}\)

\(Pr[h_{\pi}(p) = h_{\pi}(q)] = jaccard(p,q)\)

Need to pick $\pi $ from a min-wise ind. family of permutations

Map to {0,1}

Draw a function b to 0/1 from a pairwise ind. family B

So: \(h(p)\neq h(q) \Rightarrow b(h(p)) = b(h(q)) = 1/2\)

\(H’=\{b(h()) | h\in H, b\in B\}\)

Another example (“simhash”)

\[H =\{h_r(p) = 1 ,if \ r·p > 0, 0\ otherwise | r \ is\ a \ random\ unit \ vector\} \]

How do we really use it?

Reduce the number of false positives by concatenating hash function to get new hash functions (“signature”)

\[sig(p) = h1(p)h2(p) h3(p)h4(p)…… = 00101010 \]

Very close documents are hashed to the same bucket or to ‘’close” buckets \((ham(sig(p),sig(q))\) is small)

See papers on removing almost duplicates…

A theoretical result on NN

Locality Sensitive Hashing

Thm: If there exists a family H of hash functions such that

\(Pr[h(p) = h(q)] = sim(p,q)\)

then \(d(p,q) = 1-sim(p,q)\) satisfies the triangle inequality

Alternative Def (Indyk-Motwani):

A family H of functions is \((r1 < r2,p1 > p2)\)-sensitive if

  • $d(p,q) ≤ r1 \rightarrow Pr[h(p) = h(q)] ≥ p1 $

  • $d(p,q) ≥ r2 \rightarrow Pr[h(p) = h(q)] ≤ p2 $

If $d(p,q) = 1-sim(p,q) \ then \ this \ holds \ with \ p_1 = 1-r1\ and \ p2=1-r_2 \ for\ any \ r_1, r_2 $

(r,ε)-neighbor problem

  • If there is a neighbor p, such that d(p,q)\(\leq\)r, return p’, s.t. d(p’,q) \(\leq\) (1+ε)r.

  • If there is no p s.t. d(p,q)\(\leq\)(1+ε)r return nothing.

is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

  • Lets construct a data structure that succeeds with constant probability
  • Focus on the hamming distance first

NN using locality sensitive hashing

Take a (r1 < r2, p1 > p2) = (r < (1+\(\epsilon\))r, 1-r/m > 1-(1+\(\epsilon\))r/m) - sensitive family

If there is a neighbor at distance r we catch it with probability \(p_1\)so to guarantee catching it we need $1/p_1 $functions..

But we also get false positives in our 1/p1 buckets, how many ? np2/p1

注1:我以后描述点集的方式可以先整个Voronoi图

注2:min-wise没看懂

注3:concatenating连接

注4:Alternative Def 另一种定义

posted @ 2020-08-06 00:07  HuangB2ydjm  阅读(172)  评论(0编辑  收藏  举报