Sequence Model - Sequence Models & Attention Mechanism

Various Sequence To Sequence Architectures

Basic Models

Sequence to sequence model

seq2seq

Image captioning

use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN

Picking the Most Likely Sentence

translate a French sentence x to the most likely English sentence y .

it's to find

\argmaxy<1>,,y<Ty>P(y<1>,,y<Ty>|x)

  • Why not a greedy search?

    (Find the most likely words one by one) Because it may be verbose and long.

  • set the B=3(beam width), find 3 most likely English outputs

  • consider each for the most likely second word, and then find B most likely words

    beam_search
  • do it again until <EOS>

if B=1, it's just greedy search.

Length normalization

\argmaxyt=1TyP(y<t>|x,y<1>,y<t1>)

P is much less than 1 (close to 0) take log

\argmaxyt=1TylogP(y<t>|x,y<1>,y<t1>)

it tends to give the short sentences.

So you can normalize it (α is a hyperparameter)

\argmaxy1Tyαt=1TylogP(y<t>|x,y<1>,y<t1>)

Beam search discussion

  • large B : better result, slower
  • small B : worse result, faster

let y be human high quality translation, and y^ be algorithm output.

  • P(y|x)>P(y^|x) : Beam search is at fault
  • P(y|x)P(y^|x) : RNN model is at fault

Bleu(bilingual evaluation understudy) Score

if you have some good referrences to evaluate the score.

pn=n-gramsy^Countclip(n-grams)n-gramsy^Count(n-grams)

Bleu details

calculate it with exp(14n=14pn)

BP = brevity penalty

BP={1if~~MT\_output\_length > reference\_output\_lengthexp(1reference\_output\_length / MT\_output\_length)otherwise

don't want short translation.

Attention Model Intuition

it's hard for network to memorize the whole sentence.

bleu_table

compute the attention weight to predict the word from the context

Attention_Model_Intuition

Attention Model

Use a BiRNN or BiLSTM.

a<t>=(a<t>,a<t>)tα<i,t>=1c<i>=tα<i,t>α<t>

attention_model

Computing attention

α<t,t>=amount of "attention" y<t> should pay to a<t>=exp(e<t,t>)t=1Txexp(e<t,t>)

train a very small network to learn what the function is

the complexity is O(TxTy) , which is so big (quadratic cost)

computing_attention

Speech Recognition - Audio Data

Speech recognition

x(audio clip)y(transcript)

Attention model for sppech recognition

generate character by character

CTC cost for speech recognition

CTC(Connectionist temporal classification)

"ttt_h_eee___ ____qqq" "the quick brown fox"

Basic rule: collapse repeated characters not separated by "blank"

Trigger Word Detection

label the trigger word, let the output be 1s


__EOF__

本文作者zjp_shadow
本文链接https://www.cnblogs.com/zjp-shadow/p/15178221.html
关于博主:评论和私信会在第一时间回复。或者直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
声援博主:如果您觉得文章对您有帮助,可以点击文章右下角推荐一下。您的鼓励是博主的最大动力!
posted @   zjp_shadow  阅读(103)  评论(0编辑  收藏  举报
编辑推荐:
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
阅读排行:
· 单线程的Redis速度为什么快?
· 展开说说关于C#中ORM框架的用法!
· Pantheons:用 TypeScript 打造主流大模型对话的一站式集成库
· SQL Server 2025 AI相关能力初探
· 为什么 退出登录 或 修改密码 无法使 token 失效
点击右上角即可分享
微信分享提示