cs224n-语言模型
cs224n - 语言模型
Language Model
Language Modeling is the task of predicting what word comes next. More formally: given a sequence of words \(\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(t)}\), compute the probability distribution of the next word \(\boldsymbol{x}^{(t+1)}\) :
You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text \(\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(T)}\), then the
probability of this text (according to the Language Model) is:
n-gram Language Models
Question: How to learn a Language Model?
Answer: learn a n-gram Language Model!
Definition: A n-gram is a chunk of n consecutive words.
Idea: Collect statistics about how frequent different n-grams
are, and use these to predict next word.
First we make a simplifying assumption: \(\boldsymbol{x}^{(t+1)}\) depends only on the preceding n-1 words. The conditional prob is:
Question: How do we get these n-gram and (n-1)-gram probabilities?
Answer: By counting them in some large corpus of text!
Sparsity Problems with n-gram Language Models:
- What if “students opened their ” never occurred in data? Then has probability 0! Add small 𝛿 to the count for every. This is called smoothing
- What if “students opened their” never occurred in data? Then we can’t calculate probability for any w! Just condition on “opened their” instead. This is called backoff.
A fixed-window neural Language Mode
Advantages:
- No sparsity problem
- Don’t need to store all observed n-grams
Problems:
- Fixed window is too small ;
- Enlarging window enlarges W ;
- Window can never be large enough !
- \(x^{(1)}\) and \(x^{(2)}\)are multiplied by completely different weights in W. No symmetry in how the inputs are processed.
Note : We need a neural architecture that can process any length input
Recurrent Neural Networks (RNNs)
Advantages:
- Can process any length input ;
- Computation for step t can (in theory) use information from many steps back ;
- Model size doesn’t increase for longer input ;
- Same weights applied on every timestep, so there is symmetry in how inputs are processed ;
Disadvantages:
- Recurrent computation is slow
- In practice, difficult to access information from many steps back
Training a RNN Language Model
- Get a big corpus of text which is a sequence of words
- Feed into RNN-LM; compute output distribution \(\hat{\boldsymbol{y}}^{(t)}\) for every step t;
- Loss function on step t is cross-entropy between predicted probability distribution \(\hat{y}^{(t)}\), and the true next word \({y}^{(t)}\)(one-hot for \(x^{(t+1)}\)):
Average this to get overall loss for entire training set:
Backpropagation for RNNs
Evaluating Language Models
Why should we care about Language Modeling?
- Language Modeling is a benchmark task that helps us measure our progress on understanding language
- Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text