【吴恩达团队自然语言处理第四课】NLP中的注意力模型

【吴恩达团队自然语言处理第四课】NLP中的注意力模型

神经网络机器翻译#

Outline

  • Introduction to Neural Machine Translation
  • Seq2Seq model and its shortcomings
  • Solution for the information bottleneck

Seq2Seq model#

  • Introduced by Google in 2014
  • Maps variable-length sequences to fixed-length memory
  • LSTMs and GRUs are typically used to overcome the
    vanishing gradient problem
image-20220226142611226

The information bottleneck#

image-20220226142747938

由于编码器的隐藏状态大小固定,较长的序列在进入解码器的过程中成为瓶颈,所以对较短的序列表现好,较长则不好

Seq2Seq shortcomings#

  • Variable-length sentences +fixed-length memory=

    image-20220226143029522
  • As sequence size increases,model performance decreases

One vector per word#

image-20220226143725584

每个vector携带一个信息而不是将所有内容处理后组成一个大向量,

但是此模型在memorycontext方面都有明显缺陷,如何建立可以从长序列中准确预测的a time and memory model

Solution: focus attention in the right place#

  • Prevent sequence overload by giving the model a way to focus on the likeliest words at each step 通过让模型在每个步骤中专注于最可能的词来防止序列过载
  • Do this by providing the information specific to each input word 通过提供特定于每个输入单词的信息来做到这一点
image-20220226144711270

Alignment#

Motivation for alignment

希望输入的单词对其输出的单词

Correctly aligned words are the goal:

  • Translating from one language to another

  • Word sense discovery and disambiguation

    bank可能是financial institution也可能是riverbank,将单词翻译成另一种语言,根据该词在其他语言中的解释,来分辨出含义

  • Achieve alignment with a system for retrieving information step by
    step and scoring it

Word alignment
image-20220226145711631

英语翻译的单词比德语多,在对齐时需要识别单词之间的关系

Which word to pay more attention to?

为了正确对齐需要增加一层layer来使编码器了解哪些输入对预测更重要

image-20220226150116625
Give some inputs more weight!
image-20220226150219599

更粗的线条意味着更大的影响

Calculating alignment
image-20220226150450672

Attention#

Outline

  • Concept of attention for information retrieval
  • Keys,Queries,and Values

Information retrieval#

信息检索 假设您正在寻找您的钥匙。
你请你妈妈帮你找到他们。
她根据钥匙通常的位置来权衡可能性,然后告诉你最有可能的位置。
这就是Attention 正在做的事情:使用您的查询查找正确的位置,并找到钥匙。

Inside the attention layer#

image-20220227145235066

Attention#

Keys and queries=1 matrix

with the words of one query(Q) as columns and the words of the keys(K) as the rows Value score(V) assigned based on the closeness of the match

Attention=Softmax(QK^T)V

image-20220227145648724

Neural machine translation with attention#

image-20220227150320584

Flexible#

attention For languages with differentgrammar structures,attention still looks at the correct token between them

image-20220227150504369

Summary#

  • Attention is an added layer that lets a model focus on what's important
  • Queries,Values,and Keys are used for information retrieval inside the Attention layer
  • This flexible system finds matches even between languages with very different grammatical structures

Step up for machine tranlation#

Data in machine translation#

image-20220227151240949

Machine translation setup#

State-of-the-art uses pre-trained vectors

Otherwise,represent words with a one-hot vector to create the input

Keep track of index mappings with word2ind and ind2word dictionaries

Use start-of and end-of sequence tokens:

image-20220227151615157

Preparing to translate to German#

1结尾很多0,用0填充为配对的词

image-20220227152048197 image-20220227155917629

Training a NMT attention#

  • Teacher forcing
  • Model for NMT with attention

How to know predictions are correct?#

Teacher forcing allows the model to "check its work"at each step

Or,compare its prediction against the real output during training

教师强制允许模型在每一步“检查其工作”,或者,在训练期间将其预测与实际输出进行比较结果:

Result:Faster,more accurate training

Teacher forcing: motivation#

image-20220227162659669

这一步的错误会导致下一步的预测更加错误,所以每一步的预测都要检查

image-20220227163559809

预测的绿色矩形不会用于预测下一个绿色矩形,

而实际的输出是解码器的输入

Training NMT#

image-20220227165135354

pre-attention decoder将预测目标转换为不同的向量空间,即query vector,

pre-attention decodertarget token右移一位,in start of sentence token是每个序列开头的符号

image-20220227165509074

右边的四层为decoder

Evaluation for machine translation#

BLEU Score#

Stands for Bilingual Evaluation Understudy

Evaluates the quality of machine-translated text by comparing "candidate" text to one or more "reference"translations.
Scores: the closer to 1, the better, and vice versa:

image-20220227170918414

e.g.

image-20220227171015419

How many words in the candidate column appear in the reference translations?

"I"appears at most once in both,so clip to one:

mw=1

(Sum over unique ngram counts in the candidate)(total # of words in candidate)

image-20220227171515177

BLEU score is great, but..

Consider the following:

  • BLEU doesn't consider semantic meaning
  • BLEU doesn't consider sentence structure:
    "Ate I was hungry because!"

ROUGE#

Recall-Oriented Understudy for Gisting Evaluation
Evaluates quality of machine text
Measures precision and recall between generated text and human-created text

与uni-gram结合使用的示例:

reference希望model预测出的理想结果

image-20220227172624907

Recall =How much of the reference text is the system text capturing?
Precision=How much of the model text was relevant?

image-20220227172757597 image-20220227172918201

Problems in ROUGE

  • Doesn't take themes or concepts into consideration (i.e,a low ROUGE score doesn't necessarily mean the translation is bad)

    image-20220227173030809

Summary#

  • BLEU score compares"candidate" against "references"using an n-gram average
  • BLEU doesn't consider meaning or structure
  • ROUGE measures machine-generated text against an "ideal"reference

Sampling and decoding#

Outline

  • Random sampling
  • Temperature in sampling
  • Greedy decoding
  • Beam search
  • Minimum Bayes'risk(MBR)

Greedy decoding#

Selects the most probable word at each step But the best word at each step may not be the best for longer sequences..

image-20220227174332089

Random sampling#

image-20220227174520407

Often a little too random for accurate translation!
Solution: Assign more weight to more probable words, and less weight to less probable words.

Temperature#

In sampling, temperature is a parameter allowing for more or less randomness in predictions

Lower temperature setting =More confident, conservative network

Higher temperature setting =More excited, random network(and more mistakes)

image-20220227174817249

Beam search decoding#

A broader,more exploratory decoding alternative

Selects multiple options for the best input based on conditional probability

Number of options depends on a predetermined beam width parameterB

Selects B number of best alternatives at each time step

e.g.

image-20220227181034808

Problems with beam search

Since the model learns a distribution, that tends to carry more weight than single tokens

Can cause translation problems,i.e. in a speech corpus that hasn't been cleaned

image-20220227181205930 image-20220227181309888

Minimum Bayes Risk(MBR)#

Compares many samples against one another. To implement MBR:

  • Generate several random samples
  • Compare each sample against all the others and assign a similarity score(such as ROUGE!)
  • Select the sample with the highest similarity: the golden one

Example: MBR Sampling

To generate the scores for 4 samples:

  1. Calculate similarity score between sample 1 and sample2
  2. Calculate similarity score between sample 1 and sample3
  3. Calculate similarity score between sample 1 and sample 4
  4. average the score of the first 3 steps(Usually a weighted average)
  5. Repeat until all samples have overall scores

Summary#

  • Beam search uses conditional probabilities and the beam width parameter
  • MBR(Minimum Bayes Risk)takes several samples and compares them against each other to find the golden one
  • Go forth to the coding assignment!

Transformer#

Outline

  • Issues with RNNs
  • Comparison with Transformers

transformer v.s. RNN#

Neural Machine Translation#

image-20220227184039191

Seq2Seq Architectures#

image-20220227184243572

v.s#

transformer基于attention,每层不需要任何顺序计算,需要传递的梯度也只有一步,没有梯度消失问题

image-20220227184618458

Multi-headed attention#

self-attention

image-20220227184736706

self-attention可以被理解为一个意图模型,每个输入的Q K V都包含一个dense layer

添加一组self-attention被称为heads

Multi-headed attention

image-20220227185425947

Positional Encoding#

Transformers 还包含一个位置编码阶段,该阶段对序列中的每个输入位置进行编码,因为单词顺序和位置对于任何语言都非常重要。

image-20220227190409588

Summary#

  • In RNNs parallel computing is difficult to implement
  • For long sequences in RNNs there is loss of information
  • In RNNs there is the problem of vanishing gradient
  • Transformers help with all of the above

Application#

Outline

  • Transformers applications in NLP
  • Some Transformers
  • Introduction to T5

Transformers applications in NLP#

image-20220227190827697

Some Transformers#

image-20220227190950138

T5:Text-To-Text Transfer Transformer#

image-20220227191724803

不需要为每一个任务单独训练模型,一个模型就可以完成这些任务

image-20220227191956334

Summary#

  • Transformers are suitable for a wide range of NLP applications
  • GPT-2,BERT and T5 are the cutting-edge Transformers
  • T5 is a powerful multi-task transformer

dot production attention#

image-20220227212947204

Queries, Keys and Values#

image-20220227213528792

Concept of attention#

image-20220227213704719

Attention math#

image-20220227214020749

image-20220227214600928`

Summary#

  • Dot-product Attention is essential for Transformer·The input to Attention are queries,keys,and values
  • A softmax function makes attention more focused on best keys
  • GPUs and TPUs is advisable for matrix multiplications

Causal Attention#

Outline

  • Ways of Attention
  • Overview of Causal Attention
  • Math behind causal attention

Three ways of attention#

image-20220227215942918

Causal attention#

  • Queries and keys are words from the same sentence
  • Queries should only be allowed to look at words before
image-20220227220105873

Math#

image-20220227220215369image-20220227220416623

image-20220227220416623

Summary#

  • There are three main ways of Attention:Encoder/Decoder,Causal and Bi-directional type
  • In causal attention,queries and keys come from the same sentence and queries search among words before only

Multi-head attention#

  • Each head uses different linear
    transformations to represent
    words

  • Different heads can learn different
    relationships between words

    image-20220227220742152
image-20220227220911025

Concatenation#

image-20220227221134099

math#

image-20220227221416675 image-20220227221501459

Summary#

  • Different heads can learn different relationship between words
  • Scaled dot-product is adequate for Multi-Head Attention
  • Multi-Headed models attend to information from different
    representations at different positions

Transformer decoder#

image-20220227222305258 image-20220227222443521 image-20220227222844956 image-20220227223427538

Summary

  • Transformer decoder mainly consists of three layers
  • Decoder and feed-forward blocks are the core of this model code
  • It also includes a module to calculate the cross-entropy loss

Transformer summarizer#

Outline

  • Overview of Transformer summarizer
  • Technical details for data processing
  • Inference with a Language Model

overview#

image-20220227225132762

data processing#

1是<EOS>

image-20220227225625387

通过Loss weights focus on summary

cost function#

image-20220227225716270

Inference with a Language Model#

image-20220227225944252

Summary#

  • For summarization,a weighted loss function is optimized
  • Transformer Decoder summarizes predicting the next word using
  • The transformer uses tokenized versions of the input
posted @   付玬熙  阅读(232)  评论(0编辑  收藏  举报
历史上的今天:
2020-03-23 Json转csv的网站json-csv.com如何免费|不限制大小
2019-03-23 PTA L2-011 玩转二叉树 二叉树+bfs
2019-03-23 HRBUST 1181 移动 bfs模板
点击右上角即可分享
微信分享提示
主题色彩