Sequence Model - Sequence Models & Attention Mechanism
Various Sequence To Sequence Architectures
Basic Models
Sequence to sequence model
![seq2seq](https://cloud.tsinghua.edu.cn/thumbnail/72511aff189144a4a713/1024/seq2seq.jpg)
Image captioning
use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN
Picking the Most Likely Sentence
translate a French sentence \(x\) to the most likely English sentence \(y\) .
it's to find
-
Why not a greedy search?
(Find the most likely words one by one) Because it may be verbose and long.
Beam Search
-
set the \(B = 3 \text{(beam width)}\), find \(3\) most likely English outputs
-
consider each for the most likely second word, and then find \(B\) most likely words
-
do it again until \(<EOS>\)
if \(B = 1\), it's just greedy search.
Refinements to beam search
Length normalization
\(P\) is much less than \(1\) (close to \(0\)) take \(\log\)
it tends to give the short sentences.
So you can normalize it (\(\alpha\) is a hyperparameter)
Beam search discussion
- large \(B\) : better result, slower
- small \(B\) : worse result, faster
Error Analysis in Beam Search
let \(y^*\) be human high quality translation, and \(\hat y\) be algorithm output.
- \(P(y^* | x) > P(\hat y | x)\) : Beam search is at fault
- \(P(y^* | x) \le P(\hat y | x)\) : RNN model is at fault
Bleu(bilingual evaluation understudy) Score
if you have some good referrences to evaluate the score.
Bleu details
calculate it with \(\exp(\frac{1}{4} \sum_{n = 1}^4 p_n)\)
BP = brevity penalty
don't want short translation.
Attention Model Intuition
it's hard for network to memorize the whole sentence.
![bleu_table](https://cloud.tsinghua.edu.cn/thumbnail/72511aff189144a4a713/1024/bleu_table.png)
compute the attention weight to predict the word from the context
![Attention_Model_Intuition](https://cloud.tsinghua.edu.cn/thumbnail/72511aff189144a4a713/1024/Attention_Model_Intuition.png)
Attention Model
Use a BiRNN or BiLSTM.
![attention_model](https://cloud.tsinghua.edu.cn/thumbnail/72511aff189144a4a713/1024/attention_model.png)
Computing attention
train a very small network to learn what the function is
the complexity is \(\mathcal O(T_x T_y)\) , which is so big (quadratic cost)
![computing_attention](https://cloud.tsinghua.edu.cn/thumbnail/72511aff189144a4a713/1024/computing_attention.png)
Speech Recognition - Audio Data
Speech recognition
\(x(\text{audio clip}) \to y(\text{transcript})\)
Attention model for sppech recognition
generate character by character
CTC cost for speech recognition
CTC(Connectionist temporal classification)
"ttt_h_eee___ ____qqq\(\dots\)" \(\rightarrow\) "the quick brown fox"
Basic rule: collapse repeated characters not separated by "blank"
Trigger Word Detection
label the trigger word, let the output be \(1\)s
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步