cs224n-语言模型

cs224n - 语言模型

Language Model

Language Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(t)}\), compute the probability distribution of the next word \(\boldsymbol{x}^{(t+1)}\) :

\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right) \]

You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\), then the
probability of this text (according to the Language Model) is:

\[\begin{aligned} P\left(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\right) &=P\left(\boldsymbol{x}^{(1)}\right) \times P\left(\boldsymbol{x}^{(2)} | \boldsymbol{x}^{(1)}\right) \times \cdots \times P\left(\boldsymbol{x}^{(T)} | \boldsymbol{x}^{(T-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \\ &=\prod_{t=1}^{T} P\left(\boldsymbol{x}^{(t)} | \boldsymbol{x}^{(t-1)}, \ldots, \boldsymbol{x}^{(1)}\right) \end{aligned} \]

n-gram Language Models

Question: How to learn a Language Model?

Answer: learn a n-gram Language Model!

Definition: A n-gram is a chunk of n consecutive words.

Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.

First we make a simplifying assumption: \(\boldsymbol{x}^{(t+1)}\) depends only on the preceding n-1 words. The conditional prob is:

\[P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(1)}\right)=P\left(\boldsymbol{x}^{(t+1)} | \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)=\frac{{P\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}}{{P\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}} \]

Question: How do we get these n-gram and (n-1)-gram probabilities?

Answer: By counting them in some large corpus of text!

\[\approx \frac{\operatorname{count}\left(\boldsymbol{x}^{(t+1)}, \boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)}{\operatorname{count}\left(\boldsymbol{x}^{(t)}, \ldots, \boldsymbol{x}^{(t-n+2)}\right)} \]

Sparsity Problems with n-gram Language Models:

What if “students opened their ” never occurred in data? Then has probability 0! Add small 𝛿 to the count for every. This is called smoothing
What if “students opened their” never occurred in data? Then we can’t calculate probability for any w! Just condition on “opened their” instead. This is called backoff.

A fixed-window neural Language Mode

Advantages:

No sparsity problem
Don’t need to store all observed n-grams

Problems:

Fixed window is too small ;
Enlarging window enlarges W ;
Window can never be large enough !
\(x^{(1)}\) and \(x^{(2)}\)are multiplied by completely different weights in W. No symmetry in how the inputs are processed.

Note : We need a neural architecture that can process any length input

Recurrent Neural Networks (RNNs)

**Core idea** : Apply the same weights W repeatedly

Advantages:

Can process any length input ;
Computation for step t can (in theory) use information from many steps back ;
Model size doesn’t increase for longer input ;
Same weights applied on every timestep, so there is symmetry in how inputs are processed ;

Disadvantages:

Recurrent computation is slow
In practice, difficult to access information from many steps back

Training a RNN Language Model

Get a big corpus of text which is a sequence of words
Feed into RNN-LM; compute output distribution \(\hat{\boldsymbol{y}}^{(t)}\) for every step t;
Loss function on step t is cross-entropy between predicted probability distribution \(\hat{y}^{(t)}\), and the true next word \({y}^{(t)}\)(one-hot for \(x^{(t+1)}\)):

\[J^{(t)}(\theta)=C E\left(\boldsymbol{y}^{(t)}, \hat{\boldsymbol{y}}^{(t)}\right)=-\sum_{w \in V} \boldsymbol{y}_{w}^{(t)} \log \hat{\boldsymbol{y}}_{w}^{(t)}=-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Average this to get overall loss for entire training set:

\[J(\theta)=\frac{1}{T} \sum_{t=1}^{T} J^{(t)}(\theta)=\frac{1}{T} \sum_{t=1}^{T}-\log \hat{\boldsymbol{y}}_{x_{t+1}}^{(t)} \]

Backpropagation for RNNs

Evaluating Language Models

Why should we care about Language Modeling?

Language Modeling is a benchmark task that helps us measure our progress on understanding language
Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text

posted @ 2019-04-04 20:15 静_渊阅读(251) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

静渊

静渊以有谋，疏通而知事

cs224n-语言模型

cs224n - 语言模型

Language Model

n-gram Language Models

A fixed-window neural Language Mode

Recurrent Neural Networks (RNNs)

Training a RNN Language Model

Backpropagation for RNNs

Evaluating Language Models

公告

静 渊

静渊以有谋，疏通而知事

cs224n-语言模型

cs224n - 语言模型

Language Model

n-gram Language Models

A fixed-window neural Language Mode

Recurrent Neural Networks (RNNs)

Training a RNN Language Model

Backpropagation for RNNs

Evaluating Language Models

公告

静渊