【刘知远NLP课 整理】Word Representation

【刘知远NLP课 整理】Word Representation

Word representation is a process that transform the symbols to the machine understandable meanings. The goals of Word Representation are

  1. Compute word similarity

WR(Star) ≃ WR(Sun)
WR(Motel) ≃ WR(Hotel)

  1. Infer word relation

WR(China) − WR(Beijing) ≃ WR(Japan) - WR(Tokyo)
WR(Man) ≃ WR(King) − WR(Queen) + WR(Woman)
WR(Swimming) ≃ WR(Walking) − WR(Walk) + WR(Swim)

Now we start to discuss some ways of obtaining word representations.

Such as using synonyms and hypernyms to represent a word. e.g. WordNet, a resource containing synonym and hypernym sets.

However, lots of problems exist:

  • Missing nuance

    ("proficient", "good") are synonyms only in some contexts

  • Missing new meanings of words

    Apple (fruit → IT company)

  • Subjective

  • Data sparsity

  • Requires human labor to create and adapt

2. One-Hot Representation

Regard words as discrete symbols.

  • Vector dimension = # words in vocabulary
  • Token with greater ID value doesn't imply more or less important as compared with the token with less ID value.

The problem is all the vectors are orthogonal. No natural notion of similarity for one-hot vectors.

\[𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑠𝑡𝑎𝑟, 𝑠𝑢𝑛)=(𝑣_{𝑠𝑡𝑎𝑟}, 𝑣_{𝑠𝑢𝑛} )=0 \]

3. Represent Word by Context

Core idea is that a word's meaning is given by the words that frequently appear close-by, which is one of the most successful ideas of modern statistical NLP.

"You shall know a word by the company it keeps." (J.R. Firth 1957: 11)

2 ways to achieve this:

  • Co-occurrence Counts
  • Words Embeddings

Count-based distributional representation

1) Term-Term matrix

Measuring how often a word occurs with another.

\[C(i, j)=\sum_{\Delta x=-m}^{m} \sum_{x=-\Delta x}^{n-\Delta x-1}\left\{\begin{array}{c} 1, \text {if } \Delta x \neq 0, I(x)=i \text { and } I(x+\Delta x)=j \\ 0, \text { otherwise } \end{array}\right.\]

where 𝐧 is the length of sentence 𝑰, 𝐦 is the length of the window.

This can captures both syntactic and semantic information.

The shorter the windows, the more syntactic the representation (𝑚∈[1,3]) . The longer the windows, the more semantic the representation (𝑚∈[4,10]) .

2) Term-Document matrix

Measuring how often a word occurs in a document. Each document is a count vector in \(ℕ^𝑉\) (a column) and each word is a count vector in \(ℕ^𝐷\) (a row).

In large corpora, we expect to see that two words are similar if their vectors are similar and two documents are similar if their vectors are similar.

Given 2 word vectors 𝑣 and 𝑤, we can use the dot product to measure the similarity. We can normalize the similarity using vector's length which represents the word's frequency in documents.

Therefore, we often use cosine similarity in practice,

\[\operatorname{𝑐𝑜𝑠𝑖𝑛𝑒\_𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(v,w)}=\frac{v^{*} w}{|v||w|} \]

Problems of Count-Based Representation

  • Increase in size with vocabulary
  • Require a lot of storage
  • sparsity issues for those less frequent words, which will causes classification models less robust

We prefer a dense vector to a sparse one, because dense vectors have a better generalization ability and short vectors are easier to use as features in machine learning.

One way to obtain dense vectors is using SVD (Singular Value Decomposition) discribed as below.

A 𝑚×𝑛 matrix \(𝑀\) can be represented as \(𝑀=𝑈Σ𝑉^∗\), where \(𝑈\) is a 𝑚×𝑘 matrix, rows corresponding to original rows, but 𝑘 columns represents a dimension in a new latent space, such that 𝑘 column vectors are orthogonal to each other.
\(Σ\) is a 𝑘×𝑘 diagonal matrix of singular values expressing the importance of each dimension.
\(𝑉^∗\) is a 𝑘×𝑛 matrix, columns corresponding to original columns, but 𝑘 rows corresponding to the singular values.

SVD can be applied to term-term matrix (co-occurrence matrix) to create dense vectors.

If 𝑘 is not small enough, we can keep the top-k singular values (like 300) to obtain a least-squares approximation to \(𝑀\). In this way, we can reduce word representation dimension.

The 𝑋 in the diagram is a term-term matrix, and each row of 𝑊 is a 𝑘-dimensional representation of each word w.

However, SVD do poorly on word analogy and the matrix is high dimensional and very sparse. Besides, the computational cost is \(O(V^3)\), which is hard to accept.

Distributed Word Representation

In non-distributed or local representation, each possible value has a unique representation slot, which requires a lot of memory to process a large database than the distributed approach.

Whereas with the distributed approach, you could store all that data with just a few memory units:
Vehicle class (1= Large SUV, 0.1 = Compact, etc…), Brand, Price, Location, etc.

There are two main ways to get a distributed word representation:

  • Neural Language Model
  • Word Embedding

Neural Language Model

A neural language model is a language model based on neural networks, to learn distributed representations of words.

Steps:

  1. Associate words with distributed vectors
  2. Compute the joint probability of word sequences in terms of the feature vectors
  3. Optimize the word feature vectors (embedding matrix C) and the parameters of the loss function (the last layer, map matrix W)

But the problems is that the vocabulary size can be very large, maybe hundreds of thousands of different words. This will causes

  • Too many parameters
    • Lookup Table (Embedding Matrix)
    • Map matrix
  • Too many computation
    • Non-linear operation (Tanh)
    • Softmax for each words every step: Suppose there are 100,000 words in the vocabulary, we need to compute 100,000 conditional probabilities every step.

Word Embedding

Word2vec uses shallow neural networks that associate words to distributed representations. It can capture many linguistic regularities, such as "king"-"queue"="man"-"woman".

Word2vec can utilize two architectures to produce distributed representations of words:

  • Continuous bag-of-words (CBOW)

    In CBOW architecture, the model predicts the target word given a window of surrounding context words according to the bag-of-word assumption: The order of context words does not influence the prediction.

    \[\max P\left(w_{c} | w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m}\right) \]

  • Continuous skip-gram

    In skip-gram architecture, the model predicts the context words from the target word (predict one context word each step).

    \[\max P\left(w_{c-m}, \ldots, w_{c-1}, w_{c+1}, \ldots, w_{c+m} | w_{c}\right)=\prod_{-m \leq j \leq m,j\neq0} P\left(w_{c+j} | w_{c}\right) \]

Word2vec uses a sliding window of a fixed size moving along a sentence. In each window, the middle word is the target word, other words are the context words.

Given the context words, CBOW predicts the probabilities of the target word.
While given a target word, skip-gram predicts the probabilities of the context words.

We use the softmax function to compute the prediction and use cross-entropy loss function to measure the distribution difference between the prediction and the ground truth.
Then optimize the both embedding and map matrices by optimization algorithms.

However, softmax for all the words every step depends on a huge number of model parameters when the vocabulary size is very large, which is computationally impractical.

There are two main improvement methods for word2vec:

  • Negative sampling
  • Hierarchical softmax

Negative sampling

The idea is, to only update a small percentage of the weights every step.

Take skip-gram for example, negative sampling aims to differentiate noisy words from the context words given the target word.

NCE is essentially an approximation of softmax (taking log), which converts the v-category problem into a bicategory problem. \(U(w)\) is the distribution of word frequency. We sample K words (negtive sampling) based on \(P_n(w)\) (using 3/4 as an exponential can reduce the impact of overly high frequency stop words on sampling). K often takes from 10~200.

Hierarchical softmax

The hierarchical softmax groups words by frequency with a Huffman tree, reduces the computation complexity of each step from 𝑉 to log𝑉.

The leaves on the tree represent all the words in the vocabulary. We consider the process of calculating conditional probabilities as the softmax of dot product of node embedding and word embedding from root to leaf node. (\(w_I\) indicates the input word).

We consider the process of calculating the conditional probability as a serial multiplication of softmax of the vector product of each inner node embedding on the path from the root to the leaf corresponding to the target word. If it go left, the blue part of the formula get 1, else -1.

Beyond Word2vec

Subword Information: FastText

Word2vec assigns a distinct vector to each word, which ignores the internal structure of words, performs bad when vocabulary is large and some words are rare.

Therefore FastText uses subword to get word embedding. It firstly represent each word as a bag of character n-gram (e.g. 3-gram subword of where is { <wh, whe, her, ere, re>, }) then culculate word embedding as the sum of n-gram embeddings.

\[v_{w}=\sum_{g \in \mathcal{G}_{w}} z_{g} \]

It can shares the representations across words, but it can also lead to increased complexity due to a larger vocabulary (thus hierarchical softmax is needed).

Global Word Embedding: Glove

All the machine learning models above only consider the words within a sliding window but ignore the global statistical information i.e. co-occurrence matrix \(X\).

Thus the purpose of Glove is producing a vector space with meaningful sub-structure while using global statistical information.

The original similarity between two word i and j which in context with each other is

\[\widehat{Q}_{i j}=\frac{\exp \left(u_{j}^{T} v_{i}\right)}{\sum_{w=1}^{W} \exp \left(u_{w}^{T} v_{i}\right)} \]

Due to the denominator is computationally huge, so we discard the normalization and get an approximation

\[\widehat{Q}_{i j}={\exp \left(u_{j}^{T} v_{i}\right)} \]

The original likelyhood function is

\[\hat{J}=\sum_{i=1}^{W} \sum_{j=1}^{W} \left(\hat{Q}_{i j}-X_{i j}\right)^{2} \]

Glove transform it into a weighted version and taking log

\[\hat{J}=\sum_{i=1}^{W} \sum_{j=1}^{W} f\left(X_{i j}\right)\left(\log \widehat{Q}_{i j}-\log X_{i j}\right)^{2} \]

where the weighting factor \(f\left(X_{i j}\right)\) is co-occurrence matrix \(X_{ij}\). The reason for taking logarithm is that \(X_{ij}\) is usually larger and difficult for optimization.

posted @ 2021-08-15 17:19  thousfeet  阅读(281)  评论(0编辑  收藏  举报
/* */