CS224n, lec 10, NMT & Seq2Seq Attn
-
Core idea: Learn a probabilistic model from data
-
best English sentence y, given French sentence x, by bayes rule.
-
is LM is translation model.
-
Do some statistics on aligned parralle corpus
-
Rules!
NMT
The first model is an end-to-end, combining an encoder-decoder language model, and use beam search to find the most suitable result, more fluent and less handcraft, but less interpretable and harder to control(add rules).
Evaluation
BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on: n-gram precision. Good buy Imperfect.
ATTENTION!
Seq2Seq models are good, but they encode too much information in a single state, which appears to be the information bottleneck. We need some structual improvements to tackle this problem. Here we have attention.
Namely speaking, attention is another layer of assigning weights to all the hidden states of the encoder by dot product them with the initial state of the decoder, and summing the hidden states by weights, and concatenate the result with the initial state of the decoder, then take the result as the input of the decoder.
Formally we have
There a a variety of pros of the attention mechanism
-
solves the information bottleneck problem
-
helps with vanishing gradient problem: add highway-like gradient path
-
provides some interpretablity, remembering image captioning? Somehow let us know what do the neurons care about. Another example is that we can get free alignment of parallel corpus! Very cool!
Next time: more attention! I am super excited!