Word Mover's Distance Family 文本相似度评估方法
Word Mover's Distance Family 文本相似度评估方法
Word Mover's Distance Family (until 2020):
- WCD, word centroid distance
- WMD, word mover's distance
- S-WCD, supervised word centroid distance
- S-WMD, supervised word mover's distance
The WMD is a special case of the Earth Mover's Distance (EMD, also known as the Wasserstein distance).
Word Mover's Distance, WMD
From Word Embeddings To Document Distances, Kusner et al. 2015
Let \(\bm W\in\R^{d\times n}\) be a word embedding matrix, where \(n\) is the size of vocabulary, and \(d\) is the embedding dimension of words. Let \(\bm w_i\in\R^d\) denotes the i-th word, and \(c(\bm w_i, \bm w_j)\in\R\) the distance bewteen word \(i\) and word \(j\) (for example the Euclidean distance), and \(\bm d^a, \bm d^b\in\R^n\) be the n-dimensional normalized bag-of-words(BOW) for two documents (as the values of elements of the vector of a document sums to \(1\) , the vector representation is actually a \((n-1)\) dimensional simplex, other than \(n\) -dimensional), and \(d_i^a\) is the occurence count of word \(i\) in document \(\bm d^a\) , normalized over all words in \(\bm d^a\) . The WMD introduces a 'transport' matrix \(\bm T\in\R^{n\times n}\) , where \(T_{i,j}\) denotes how many times transports from word \(i\) in document \(\bm d^a\) to word \(j\) in document \(\bm d^b\) . The WMD searches \(\bm T\) :
( the distance between the two documents is the minimum of the optimization problem.)
for the Euclidean distance used as word distance:
time complexity: \(O(?)\)
Supervised Mover's Distance, S-WMD
source code on github (in MATLAB) by author
paper on NeurIPS (NIPS) 2016
\(\bm A\) : transformation matrix on word embedding space. \(\bm w_i \mapsto \bm A\bm w_i\) (parameter)
\(\bm u\) : a histogram importance vector that re-weighs the word histogram values to reflect the importance of words for distinguishing the classes:
where \(\odot\) denotes the element-wise Hadamard product.
loss function ( kNN-LOO error ):
where \(\delta_{ab}=1 \iff y_a=y_b\) , and \(\delta_{ab}=0 \iff y_a \ne y_b\) (indicator value for whether \(y_a= y_b\) ).