cs224n-语言模型

cs224n - 语言模型

Language Model

Language Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(t)}\), compute the probability distribution of the next word \(\boldsymbol{x}^{(t+1)}\) :

\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right) \]

You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\), then the
probability of this text (according to the Language Model) is:

\[\begin{aligned} P\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\right) &=P\left(\boldsymbol{x}^{(1)}\right) \times P\left(\boldsymbol{x}^{(2)} | \boldsymbol{x}^{(1)}\right) \times \cdots \times P\left(\boldsymbol{x}^{(T)} | \boldsymbol{x}^{(T-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \\ &=\prod_{t=1}^{T} P\left(\boldsymbol{x}^{(t)} | \boldsymbol{x}^{(t-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \end{aligned} \]

n-gram Language Models

Question: How to learn a Language Model?

Answer: learn a n-gram Language Model!

Definition: A n-gram is a chunk of n consecutive words.

Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.

First we make a simplifying assumption: \(\boldsymbol{x}^{(t+1)}\) depends only on the preceding n-1 words. The conditional prob is:

\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right)=P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)=\frac{{P\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}}{{P\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}} \]

Question: How do we get these n-gram and (n-1)-gram probabilities?

Answer: By counting them in some large corpus of text!

\[\approx \frac{\operatorname{count}\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}{\operatorname{count}\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)} \]

Sparsity Problems with n-gram Language Models:

  • What if “students opened their ” never occurred in data? Then has probability 0! Add small 𝛿 to the count for every. This is called smoothing
  • What if “students opened their” never occurred in data? Then we can’t calculate probability for any w! Just condition on “opened their” instead. This is called backoff.

A fixed-window neural Language Mode

图片名称

Advantages:

  • No sparsity problem
  • Don’t need to store all observed n-grams

Problems:

  • Fixed window is too small ;
  • Enlarging window enlarges W ;
  • Window can never be large enough !
  • \(x^{(1)}\) and \(x^{(2)}\)are multiplied by completely different weights in W. No symmetry in how the inputs are processed.

Note : We need a neural architecture that can process any length input

Recurrent Neural Networks (RNNs)

图片名称
**Core idea** : Apply the same weights W repeatedly
图片名称

Advantages:

  • Can process any length input ;
  • Computation for step t can (in theory) use information from many steps back ;
  • Model size doesn’t increase for longer input ;
  • Same weights applied on every timestep, so there is symmetry in how inputs are processed ;

Disadvantages:

  • Recurrent computation is slow
  • In practice, difficult to access information from many steps back

Training a RNN Language Model

  • Get a big corpus of text which is a sequence of words
  • Feed into RNN-LM; compute output distribution \(\hat{\boldsymbol{y}}^{(t)}\) for every step t;
  • Loss function on step t is cross-entropy between predicted probability distribution \(\hat{y}^{(t)}\), and the true next word \({y}^{(t)}\)(one-hot for \(x^{(t+1)}\)):

\[J^{(t)}(\theta)=C E\left(\boldsymbol{y}^{(t)}, \hat{\boldsymbol{y}}^{(t)}\right)=-\sum_{w \in V} \boldsymbol{y}_{w}^{(t)} \log \hat{\boldsymbol{y}}_{w}^{(t)}=-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Average this to get overall loss for entire training set:

\[J(\theta)=\frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta)=\frac{1}{T} \sum_{t=1}^{T}-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Backpropagation for RNNs

Evaluating Language Models

图片名称

Why should we care about Language Modeling?

  • Language Modeling is a benchmark task that helps us measure our progress on understanding language
  • Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text
posted @ 2019-04-04 20:15  静_渊  阅读(251)  评论(0编辑  收藏  举报