Big Data Tech and Analytics --- Locality Sensitive Hashing

Motivation: finding "similar" sets in high-dimensional space

Defination:

  • Distance Measures: Aim to find "near neighbors" in high-dimensional space
    •   We formally define “near neighbors” as points that are a “small distance” apart
  • Jaccard distance/similarity
    The Jaccard Similarity/Distance of two sets is the size of their intersection / the size of their union:
    𝑠𝑖𝑚(𝐶1, 𝐶2 )=|𝐶∩ 𝐶2 | / |𝐶∪ 𝐶2 |
    𝑑(𝐶1, 𝐶2 )=1−|𝐶1  ∩ 𝐶2 | / |𝐶∪ 𝐶2 |

1. Finding Similar Documents

Goal: Given a large number (N in the millions or billions) of text documents, find pairs that are “near duplicates”

Applications:

  • Mirror websites, or approximate mirrors
    • Don’t want to show both in a search
  • Similar news articles at many news sites
    • Cluster articles by “same story"

Problems:

  • Many pieces of one document can appear out of order in another
  • Too many documents to compare all pairs
  • Documents are so large or so many that they cannot fit in main memory

Three Essential Steps to solve problems:

  • Shingling: Convert documents to sets
  • Minhashing: Convert large sets to short signatures, while preserving similarity
  • Locality-sensitive hashing: Focus on pairs of signatures likely to be from similar documents (Candidate pairs!)

 

 1.1 Shingling

  1.1.1 Definition

    A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the doc.

    Tokens can be characters, words or something else, depending on the application

    Assume tokens = characters for examples

    Example: k = 2; document D1 = abcab

    Set of 2-shingles: S(D1)={ab, bc, ca}

  1.1.2 Compressing Shingles

    To compress long shingles, we can hash them

    Represent a doc by the set of hash values of its k-shingles

    Idea: Two documents could (rarely) appear to have singles in common, when in fact only the hash-values were shared
    Example: k = 2; document D1 = abcab
    Set of 2-shingles: S(D1)={ab, bc, ca}
    Hash the shingles: h(D1)={1, 5, 7}

 

  1.1.3 Similarity Metric for Shingles

    Document D1 = set of k-shingles C1 = S(D1)

    Equivalently, each document is a 0/1 vector in the space of k-shingles. E.g. C1 = [0, 1, 0, 0, 0,1, 0, 0, 1, 0...]

    Each unique shingle is a dimension. Vectors are very sparse.

    A natural similarity measure is the Jaccard similarity: 𝑠𝑖𝑚(𝐶1, 𝐶2 )=|𝐶∩ 𝐶2 | / |𝐶∪ 𝐶2 |

1.2 MinHashing

  1.2.1 Motivation

    Suppose we need to find near-duplicate documents among N=1 million documents

    Naively, we’d have to compute pairwise Jaccard similarities for every pair of docs

    i.e. 𝑁(𝑁−1)/2≈5∗10^11 comparisons

    At 10^5 secs/day and 10^6 comparison/sec, it would take 5 days

    For N=10 million, it takes more than a year…

  1.2.2 Encoding Sets as Bit Vector AND From sets to Boolean Matrices

    Many similarity problems can be formalized as finding subsets that have significant intersection

    Encode sets using 0/1 vectors

    One dimension per element in the universal set

    Interpret set intersection as bitwise AND, and set union as bitwise OR

    Example: 𝐶=10111; 𝐶2=10011

      Size of intersection = 3; size of union = 4,

      Jaccard similarity = 3/4

      𝑑(𝐶1, 𝐶2 )= 1 – (Jaccard similarity) = 1/4

    Boolean Matrices:

      Rows = elements (shingles)

      Columns = sets (documents)

      1 in row e and column s if and only if e is a member of s

      Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)

      Each document is a column

      Example:

      

 

    1.2.3 Finding Similar Columns

    Challenges: too many long columns to compare

    1.2.3.1 Approach: 

    1) Signatures of columns: small summaries of columns
    2) Examine pairs of signatures to find similar columns
    Comparing all pairs may take too much time – Job for LSH
    3) Optional: Check whether columns with similar signatures are really similar

    1.2.3.2 Key Ideas for hashing columns (Signatures)

    Key idea 1: “hash” each column C to a small signature h(C), such that:

      1) h(C) is small enough that the signature fits in RAM

      2) 𝑠𝑖𝑚(𝐶1, 𝐶2 ) is the same as the “similarity” of signatures ℎ(𝐶1) and ℎ(𝐶2)

    Key idea 2: Hash documents into buckets, and expect that “most” pairs of near duplicate docs hash into the same bucket!

    1.2.3.3 Goal for Min-Hashing:

      Find a hash function h(·) such that:

      If 𝑠𝑖𝑚(𝐶1, 𝐶2) is high, then with high prob. ℎ(𝐶1 )=ℎ(𝐶2)

      If 𝑠𝑖𝑚(𝐶1, 𝐶2) is low, then with high prob. ℎ(𝐶1 )≠ℎ(𝐶2)

      Clearly, the hash function depends on the similarity metric. Not all similarity metrics have a suitable hash function. For Jaccard similarity: Min-hashing

    1.2.3.4 Solution:

      Imagine the rows of the boolean matrix permuted under random permutation 𝜋

      Define a “hash” function ℎ𝜋 (𝐶) = the number of the first (in the permuted order 𝜋) row in which column C has a value 1:   ℎ𝜋(𝐶) = min𝜋𝜋(𝐶)

      Use several (e.g., 100) independent hash functions to create a signature of a column

    Example:

    In the second permutation order, firstly we look at the third row "1" means the first row in the permuted order. Since we find "0" in the third row of third column, we continue to find "2" in the second permutation order, and it's also "0" in the second cloumn of third column. As for "3", there is also "0" in the forth row of column. Finally we find "4" as the result shown in the following figure.

    In conclusion, the smallest number in the permutation 𝜋  with the number "1" in the same row in the input matrix should be the result

    

 

    1.2.3.5 Property:

    Choose a random permutation 𝜋

    Claim: Pr[ℎ𝜋(𝐶1))=ℎ𝜋(𝐶2)] = 𝑠𝑖𝑚(𝐶1, 𝐶2)

    1.2.3.6 Proof:

   

 

 

    

 

     1.2.3.7 Similarity for Signatures

    The similarity of two signatures is the fraction of the hash functions in which they agree

    Example:

    

 

 

 

    Pick K=100 random permutations of the rows
    Think of sig(C) as a column vector
    sig(C)[i] = according to the i-th permutation, the index of the first row that has a 1 in column C:
    𝑠𝑖𝑔(𝐶)[𝑖]=min⁡(𝜋i (𝐶))
    Note: The sketch (signature) of document C is small -- ~100 bytes!
    We achieved our goal! We “compressed” long bit vectors into short signatures

1.3 Locality Sensitive Hashing

 1.3.1 First CUT

  Goal: Find documents with Jaccard similarity at least s 

  General idea: Use a (hash) function f(x,y) that tells whether x and y is a candidate pair: a pair of elements whose similarity must be evaluated

  For minhash matrices:

      Hash columns of signature matrix to many buckets

      Each pair of documents that hashes into the same bucket is a candidate pair (for further examination)

  How to find Candidate pairs:  

      Pick a similarity threshold s (0 < s < 1)

      Columns x and y of signature matrix M are a candidate pair if their signatures agree on at least fraction s of their rows:

      𝑀(𝑖,x )=𝑀(𝑖,y) for at least fraction s values of 𝑖

      We expect documents x and y to have the same similarity as is the similarity of their signatures

  1.3.2 LSH solution

    1.3.2.1

 

  Divide matrix M into b bands of r rows

  For each band, hash its portion of each column to a hash table with k buckets

    Make k as large as possible

    Assuming two vectors hash to the same bucket if and only if they are identical

  Candidate column pairs are those that hash to the same bucket for ≥1 band

  Tune b and r to catch most similar pairs, but few non-similar pairs

 

  1.3.2.2 Assumption:

    There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band
    Hereafter, we assume that “same bucket” means “identical in that band”

    Assumption needed only to simplify analysis, not for correctness of algorithm

  1.3.2.3 Tradeoff to balance false positive and false negative

  What influence:

    the number of minhashes (rows of M)

    the number of bands b

    the number of rows r per band

  Probability of true positive:

  

 

 

 

 

  

 

 

 

  

 

posted @ 2019-11-11 17:18  FrancisForeverhappy  阅读(311)  评论(0编辑  收藏  举报