cs224n word2vec

cs224n-word2vec

摘要:

  • word2vec原始想法的来源:分布式假设;
  • word2vec原始的优化函数,基于似然函数;
  • 如何通过softmax计算概率,以及softmax的简单解释;
  • word2vec的两种变形,及加速方法的介绍;
  • HW2 采样是如何进行的,以及为什么需要这样的操作;
  • 基于共现矩阵的方法,主要是 LSI (LSA);
  • 对 LSI 的一些改进;

描述

希望将单词的语义编码成向量表示;

分布式语义表示(Distributional semantics)

A word’s meaning is given by the words that frequently appear close-by

word2vec的想法很简单,就是假设一个单词的语义和这个单词的上下文是相关的,我们可以使用这个单词的上下文来表示这个单词的语义信息。

注释:一定程度上可以这样理解,但是是否有更好的假设?周围的context是否就一定能很好的表示当前这个单词。

word2vec介绍

框架:

  • 足够多的语料;
  • 词库里的单词都可以表示为一个向量;
  • 遍历文本中的每个位置 \(t\) ,可以得到中心单词 \(w_t\) 和 上下文信息 \(c_t\)
  • 用单词向量之间的相似度来计算条件概率\(p(c|w_t)\)
  • 调整单词的表示,最大化概率;

目标函数:

For each position \(i = 1, … ,T\) predict context words within a
window of fixed size \(m\), given center word \(w_j\):

\[Likehood = L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m} P\left(w_{t+j} | w_{t} ; \theta\right) \]

The objective function \(J(\theta)\) is the (average) negative log likelihood:

\[J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \atop j \neq 0} \log P\left(w_{t+j} | w_{t} ; \theta\right) \]

注:这里有一个常规的操作,先写出似然函数,再变为损失函数

问题: 如何计算 \(P\left(w_{t+j} | w_{t} ; \theta\right)\)?

For a center word \(c\) and a context word \(o\):

\[P(o | c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)} \]

这个问题其实困扰了我好久,一直不知道如何计算概率。看到这个好像忽然明白了,感觉和这个形式有点像: \(p(y|x)=\frac{p(x,y)}{p(x)}\)

softmax函数:

其实上面计算概率的公式本质上是一个softmax 函数:

\[\operatorname{softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j=1}^{n} \exp \left(x_{j}\right)}=p_{i} \]

The softmax function maps arbitrary values \(x_i\) to a probability distribution \(p_i\): \

  • “max” because amplifies probability of largest \(x_i\)
  • “soft” because still assigns some probability to smaller \(x_i\)

参数优化

\(\theta\) represents all model parameters, in one long vector. In our case with d-dimensional vectors and V-many words:

\[\theta=\left[ \begin{array}{l}{v_{\text {aardvark}}} \\ {v_{a}} \\ {\vdots} \\ {v_{z e b r a}} \\ {u_{\text {aardvark}}} \\ {u_{a}} \\ {\vdots} \\ {u_{z e b r a}}\end{array}\right] \in \mathbb{R}^{2 d V} \]

Remember: every word has two vectors.
Why two vectors? Easier optimization. Average both at the end.

word2vec参数求导

  • Useful basics:

\[\frac{\partial \mathbf{x}^{T} \mathbf{a}}{\partial \mathbf{x}}=\frac{\partial \mathbf{a}^{T} \mathbf{x}}{\partial \mathbf{x}}=\mathbf{a} \]

  • Chain rule: if \(y=f(u) \text { and } u=g(x), \text { i.e. } y=f(g(x)),\):

\[\frac{d y}{d x}=\frac{d y}{d u} \frac{d u}{d x} \]

两种模型

  • Skip-grams (SG)
    Predict context (”outside”) words (position independent) given center
  • Continuous Bag of Words (CBOW)
    Predict center word from (bag of) context words

加速方法

  • Negative sampling
  • Hierarchical softmax

The skip-gram model with negative sampling (HW2)

问题: The normalization factor is too computationally expensive.

主要想法: train binary logistic regressions for a true pair (center
word and word in its context window) versus several noise pairs
(the center word paired with a random word)

目标函数:

\[J(\theta)=\frac{1}{T} \sum_{t=1}^{T} J_{t}(\theta) \]

\[J_{t}(\theta)=\log \sigma\left(u_{o}^{T} v_{c}\right)+\sum_{i=1}^{k} \mathbb{E}_{j \sim P(w)}\left[\log \sigma\left(-u_{j}^{T} v_{c}\right)\right] \]

注:\

  • take k negative samples (using word probabilities)
  • \(\mathrm{P}(w)=U(w)^{3 / 4} / Z\)
  • The power makes less frequent words be sampled more often

But why not capture co-occurrence counts directly?

  • 2 options: windows vs. full document
  • Window: Similar to word2vec, use window around each word -> captures both syntactic (POS) and semantic information
  • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” semantic information

Problems with simple co-occurrence vectors

  • Increase in size with vocabulary
  • Very high dimensional: requires a lot of storage
  • Subsequent classification models have sparsity issues -> Models are less robust

Solution: Low dimensional vectors

This questions equals "how to reduce the dimensionality?".

Method 1: Dimensionality Reduction on X (HW1)\

  • Singular Value Decomposition of co-occurrence matrix X
    Factorizes X into \(U\Sigma V^T\), where U and V are orthonormal

Hacks to X

  • Scaling the counts in the cells can help a lot
  • Problem: function words (the, he, has) are too frequent -> syntax has too much impact.
    • \(min(X,t), with\ t = 100\)
    • Ignore them all
    • Use Pearson correlations instead of counts, then set negative values to 0

Count based vs. direct prediction :

  • LSA, HAL, COALS, Hellinger-PCA
    • Fast training
    • Efficient usage of statistics
    • Primarily used to capture word similarity
    • Disproportionate importance given to large counts
  • Skip-gram/CBOW, NNLM, HLBL, RNN
    • Generate improved performance on other tasks
    • Can capture complex patterns beyond word similarity
    • Scales with corpus size
    • Inefficient usage of statistics

Global Vectors for Word Representation (GloVe)

  • The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure.
  • The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.

主要思想:

GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics;

算法:

Let \(X\) denote the word-word co-occurrence matrix, where \(x_{ij}\) indicates the number of times word \(j\) occur in the context of word \(i\), y, let \(P_{i j}=P\left(w_{j} | w_{i}\right)=\frac{X_{i j}}{X_{i}}\) be the probability of j appearing in the context of word i.

The objective of skip-gram is :

\[J=-\sum_{i \in \text {corpus} j \in \operatorname{context}(i)} \log Q_{i j} \]

\[Q_{i j}=\frac{\exp \left(\vec{u}_{j}^{T} \vec{v}_{i}\right)}{\sum_{a w=1}^{W} \exp \left(\vec{u}_{w}^{T} \vec{v}_{i}\right)} \]

One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, we use a least square objective in which the normalization factors in P and Q are discarded:

\[\hat{J}=\sum_{i=1}^{W} \sum_{j=1}^{W} X_{i}\left(\hat{P}_{i j}-\hat{Q}_{i j}\right)^{2} \]

where \(\hat{P}_{i j}=X_{i j}\) and \(\hat{Q}_{i j}=\exp \left(\vec{u}_{j}^{T} \vec{v}_{i}\right)\)are the unnormalized distributions.

This formulation introduces a new problem – \(X_{ij}\) often takes on very large values and makes the optimization difficult.

\[\begin{aligned} \hat{\jmath} &=\sum_{i=1}^{W} \sum_{j=1}^{W} X_{i}\left(\log (\hat{P})_{i j}-\log \left(\hat{Q}_{i j}\right)\right)^{2} \\ &=\sum_{i=1}^{W} \sum_{j=1}^{W} X_{i}\left(\vec{u}_{j}^{T} \vec{v}_{i}-\log X_{i j}\right)^{2} \end{aligned} \]

Another observation is that the weighting factor Xi
is not guaranteed to be optimal. Instead, we introduce a more general weighting
function, which we are free to take to depend on the context word as
well:

\[\hat{\jmath}=\sum_{i=1}^{W} \sum_{j=1}^{W} f\left(X_{i j}\right)\left(\vec{u}_{j}^{T} \vec{v}_{i}-\log X_{i j}\right)^{2} \]

word2vec 评测方法

参考文献

[1] Learning representations by back-propagating errors (Rumelhart et al.,1986)
[2] A neural probabilistic language model (Bengio et al, 2003)
[3] NLP (almost) from Scratch (Collobert & Weston, 2008) \

posted @ 2019-03-28 15:26  静_渊  阅读(392)  评论(0编辑  收藏  举报