Word Mover's Distance Family 文本相似度评估方法

Word Mover's Distance Family 文本相似度评估方法

Word Mover's Distance Family (until 2020):

  • WCD, word centroid distance
  • WMD, word mover's distance
  • S-WCD, supervised word centroid distance
  • S-WMD, supervised word mover's distance

The WMD is a special case of the Earth Mover's Distance (EMD, also known as the Wasserstein distance).

Word Mover's Distance, WMD

From Word Embeddings To Document Distances, Kusner et al. 2015

Let \(\bm W\in\R^{d\times n}\) be a word embedding matrix, where \(n\) is the size of vocabulary, and \(d\) is the embedding dimension of words. Let \(\bm w_i\in\R^d\) denotes the i-th word, and \(c(\bm w_i, \bm w_j)\in\R\) the distance bewteen word \(i\) and word \(j\) (for example the Euclidean distance), and \(\bm d^a, \bm d^b\in\R^n\) be the n-dimensional normalized bag-of-words(BOW) for two documents (as the values of elements of the vector of a document sums to \(1\) , the vector representation is actually a \((n-1)\) dimensional simplex, other than \(n\) -dimensional), and \(d_i^a\) is the occurence count of word \(i\) in document \(\bm d^a\) , normalized over all words in \(\bm d^a\) . The WMD introduces a 'transport' matrix \(\bm T\in\R^{n\times n}\) , where \(T_{i,j}\) denotes how many times transports from word \(i\) in document \(\bm d^a\) to word \(j\) in document \(\bm d^b\) . The WMD searches \(\bm T\) :

\[\mathrm{distance}(\bm d^a, \bm d^b) := \inf_{\bm T\ge 0} \sum_{i,j}^n T_{i,j} c(\bm w_i, \bm w_j) \\ \text{ s.t. } \sum_i T_{i,j}=d_j, \sum_j T_{i,j}=d_i, \forall i,j \]

( the distance between the two documents is the minimum of the optimization problem.)

for the Euclidean distance used as word distance:

\[D(\bm d^a, \bm d^b) := \inf_{\bm T\ge 0} \sum_{i,j=1}^n T_{i,j} \|\bm w_i - \bm j\|^2 \\ \text{ s.t. } \sum_{j=1}^n T_{i,j}=d_i^a, \sum_{i=1}^n T_{i,j}=d_i^a, \forall i,j \]

time complexity: \(O(?)\)

Supervised Mover's Distance, S-WMD

source code on github (in MATLAB) by author

paper on NeurIPS (NIPS) 2016

\(\bm A\) : transformation matrix on word embedding space. \(\bm w_i \mapsto \bm A\bm w_i\) (parameter)

\(\bm u\) : a histogram importance vector that re-weighs the word histogram values to reflect the importance of words for distinguishing the classes:

\[\tilde {\bm d^a} = (\bm u\odot \bm d^a)/(\bm u^T \bm d^a) \]

where \(\odot\) denotes the element-wise Hadamard product.

\[D_{\bm A,\bm u}(\bm d^a, \bm d^b) := \min_{\bm T\ge 0} \sum_{i,j}^n T_{i,j} \|\bm A(\bm w_i - \bm w_j)\|^2 \\ \text{ s.t. } \sum_{j=1}^n T_{i,j}=\tilde{d_i^a}, \sum_{i=1}^n T_{i,j}=\tilde {d_i^a}, \forall i,j \]

loss function ( kNN-LOO error ):

\[l(\bm A,\bm u)= - \sum_{a=1}^m \log \left( \sum_{b:y_b=y_a}^m \frac{\exp(-D_{\bm A,\bm u}(\bm d^a, \bm d^b)))}{\sum_{c\ne a}\exp(-D_{\bm A,\bm u}(\bm d^a, \bm d^c))} \right) \]

\[\frac{\partial}{\partial (\bm A,\bm u)} = l(\bm A,\bm u)=\sum_{a=1}^m\sum_{b\ne a} \frac{p_{ab}}{p_a}(\delta_{ab} - p_a) \frac{\partial}{\partial (\bm A,\bm u)}D_{\bm A,\bm u}(d^a, d^b) \]

where \(\delta_{ab}=1 \iff y_a=y_b\) , and \(\delta_{ab}=0 \iff y_a \ne y_b\) (indicator value for whether \(y_a= y_b\) ).

posted @ 2022-07-06 23:25  二球悬铃木  阅读(63)  评论(0编辑  收藏  举报