关于敏感哈希
Locality Sensitive Hashing
nearest neighbor
given a set P of n points in \(\cal R^d\)
Want to build a data structure to answer nearest neighbor queries
Voronoi Diagram
Build a Voronoi diagram & a point location data structure
Voronoi图,又叫泰森多边形或Dirichlet图,它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。
Curse of dimensionality
In R2 the Voronoi diagram is of size O(n)
Query takes O(logn) time
In Rd the complexity is \(O(n^{d/2})\)
Other techniques also scale bad with the dimension
Locality Sensitive Hashing
We will use a family of hash functions such that close points tend to hash to the same bucket.
Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket
Def (Charikar):
A family H of functions is locality sensitive with respect to a similarity function 0 ≤ sim(p,q) ≤ 1 if
Example – Hamming Similarity
Think of the points as strings of m bits and consider the similarity \(sim(p,q) = 1-ham(p,q)/m\)
$H={h_i(p) = the\ i-th\ bit\ of\ p} $is locality sensitive wrt
\(sim(p,q) = 1-ham(p,q)/m\)
$Pr[h(p) = h(q)] = 1 – ham(p,q)/m $
\(1-sim(p,q) = ham(p,q)/m\)
Example - Jaacard
Think of p and q as sets
\(sim(p,q) = jaccard(p,q) = |p\cap q|/|p\cup q|\)
\(H=\{h_{\pi}(p) = \min \ in\ \pi\ of \ the \ items \ in \ p\}\)
\(Pr[h_{\pi}(p) = h_{\pi}(q)] = jaccard(p,q)\)
Need to pick $\pi $ from a min-wise ind. family of permutations
Map to {0,1}
Draw a function b to 0/1 from a pairwise ind. family B
So: \(h(p)\neq h(q) \Rightarrow b(h(p)) = b(h(q)) = 1/2\)
\(H’=\{b(h()) | h\in H, b\in B\}\)
Another example (“simhash”)
How do we really use it?
Reduce the number of false positives by concatenating hash function to get new hash functions (“signature”)
Very close documents are hashed to the same bucket or to ‘’close” buckets \((ham(sig(p),sig(q))\) is small)
See papers on removing almost duplicates…
A theoretical result on NN
Locality Sensitive Hashing
Thm: If there exists a family H of hash functions such that
\(Pr[h(p) = h(q)] = sim(p,q)\)
then \(d(p,q) = 1-sim(p,q)\) satisfies the triangle inequality
Alternative Def (Indyk-Motwani):
A family H of functions is \((r1 < r2,p1 > p2)\)-sensitive if
-
$d(p,q) ≤ r1 \rightarrow Pr[h(p) = h(q)] ≥ p1 $
-
$d(p,q) ≥ r2 \rightarrow Pr[h(p) = h(q)] ≤ p2 $
If $d(p,q) = 1-sim(p,q) \ then \ this \ holds \ with \ p_1 = 1-r1\ and \ p2=1-r_2 \ for\ any \ r_1, r_2 $
(r,ε)-neighbor problem
-
If there is a neighbor p, such that d(p,q)\(\leq\)r, return p’, s.t. d(p’,q) \(\leq\) (1+ε)r.
-
If there is no p s.t. d(p,q)\(\leq\)(1+ε)r return nothing.
is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)
- Lets construct a data structure that succeeds with constant probability
- Focus on the hamming distance first
NN using locality sensitive hashing
Take a (r1 < r2, p1 > p2) = (r < (1+\(\epsilon\))r, 1-r/m > 1-(1+\(\epsilon\))r/m) - sensitive family
If there is a neighbor at distance r we catch it with probability \(p_1\)so to guarantee catching it we need $1/p_1 $functions..
But we also get false positives in our 1/p1 buckets, how many ? np2/p1
注1:我以后描述点集的方式可以先整个Voronoi图
注2:min-wise没看懂
注3:concatenating连接
注4:Alternative Def 另一种定义