【吴恩达团队自然语言处理第四课】NLP中的注意力模型

【吴恩达团队自然语言处理第四课】NLP中的注意力模型

神经网络机器翻译

Outline

  • Introduction to Neural Machine Translation
  • Seq2Seq model and its shortcomings
  • Solution for the information bottleneck

Seq2Seq model

  • Introduced by Google in 2014
  • Maps variable-length sequences to fixed-length memory
  • LSTMs and GRUs are typically used to overcome the
    vanishing gradient problem
image-20220226142611226

The information bottleneck

image-20220226142747938

由于编码器的隐藏状态大小固定,较长的序列在进入解码器的过程中成为瓶颈,所以对较短的序列表现好,较长则不好

Seq2Seq shortcomings

  • Variable-length sentences +fixed-length memory=

    image-20220226143029522
  • As sequence size increases,model performance decreases

One vector per word

image-20220226143725584

每个vector携带一个信息而不是将所有内容处理后组成一个大向量,

但是此模型在memorycontext方面都有明显缺陷,如何建立可以从长序列中准确预测的a time and memory model

Solution: focus attention in the right place

  • Prevent sequence overload by giving the model a way to focus on the likeliest words at each step 通过让模型在每个步骤中专注于最可能的词来防止序列过载
  • Do this by providing the information specific to each input word 通过提供特定于每个输入单词的信息来做到这一点
image-20220226144711270

Alignment

Motivation for alignment

希望输入的单词对其输出的单词

Correctly aligned words are the goal:

  • Translating from one language to another

  • Word sense discovery and disambiguation

    bank可能是financial institution也可能是riverbank,将单词翻译成另一种语言,根据该词在其他语言中的解释,来分辨出含义

  • Achieve alignment with a system for retrieving information step by
    step and scoring it

Word alignment
image-20220226145711631

英语翻译的单词比德语多,在对齐时需要识别单词之间的关系

Which word to pay more attention to?

为了正确对齐需要增加一层layer来使编码器了解哪些输入对预测更重要

image-20220226150116625
Give some inputs more weight!
image-20220226150219599

更粗的线条意味着更大的影响

Calculating alignment
image-20220226150450672

Attention

Outline

  • Concept of attention for information retrieval
  • Keys,Queries,and Values

Information retrieval

信息检索 假设您正在寻找您的钥匙。
你请你妈妈帮你找到他们。
她根据钥匙通常的位置来权衡可能性,然后告诉你最有可能的位置。
这就是Attention 正在做的事情:使用您的查询查找正确的位置,并找到钥匙。

Inside the attention layer

image-20220227145235066

Attention

Keys and queries=1 matrix

with the words of one query(Q) as columns and the words of the keys(K) as the rows Value score(V) assigned based on the closeness of the match

Attention=Softmax(QK^T)V

image-20220227145648724

Neural machine translation with attention

image-20220227150320584

Flexible

attention For languages with differentgrammar structures,attention still looks at the correct token between them

image-20220227150504369

Summary

  • Attention is an added layer that lets a model focus on what's important
  • Queries,Values,and Keys are used for information retrieval inside the Attention layer
  • This flexible system finds matches even between languages with very different grammatical structures

Step up for machine tranlation

Data in machine translation

image-20220227151240949

Machine translation setup

State-of-the-art uses pre-trained vectors

Otherwise,represent words with a one-hot vector to create the input

Keep track of index mappings with word2ind and ind2word dictionaries

Use start-of and end-of sequence tokens:

image-20220227151615157

Preparing to translate to German

1结尾很多0,用0填充为配对的词

image-20220227152048197 image-20220227155917629

Training a NMT attention

  • Teacher forcing
  • Model for NMT with attention

How to know predictions are correct?

Teacher forcing allows the model to "check its work"at each step

Or,compare its prediction against the real output during training

教师强制允许模型在每一步“检查其工作”,或者,在训练期间将其预测与实际输出进行比较结果:

Result:Faster,more accurate training

Teacher forcing: motivation

image-20220227162659669

这一步的错误会导致下一步的预测更加错误,所以每一步的预测都要检查

image-20220227163559809

预测的绿色矩形不会用于预测下一个绿色矩形,

而实际的输出是解码器的输入

Training NMT

image-20220227165135354

pre-attention decoder将预测目标转换为不同的向量空间,即query vector,

pre-attention decodertarget token右移一位,in start of sentence token是每个序列开头的符号

image-20220227165509074

右边的四层为decoder

Evaluation for machine translation

BLEU Score

Stands for Bilingual Evaluation Understudy

Evaluates the quality of machine-translated text by comparing "candidate" text to one or more "reference"translations.
Scores: the closer to 1, the better, and vice versa:

image-20220227170918414

e.g.

image-20220227171015419

How many words in the candidate column appear in the reference translations?

"I"appears at most once in both,so clip to one:

\[m_w=1\\ \]

\[\frac{(Sum\ over\ unique\ n-gram\ counts\ in\ the\ candidate) }{(total\ \#\ of\ words\ in\ candidate)} \]

image-20220227171515177

BLEU score is great, but..

Consider the following:

  • BLEU doesn't consider semantic meaning
  • BLEU doesn't consider sentence structure:
    "Ate I was hungry because!"

ROUGE

Recall-Oriented Understudy for Gisting Evaluation
Evaluates quality of machine text
Measures precision and recall between generated text and human-created text

与uni-gram结合使用的示例:

reference希望model预测出的理想结果

image-20220227172624907

Recall =How much of the reference text is the system text capturing?
Precision=How much of the model text was relevant?

image-20220227172757597 image-20220227172918201

Problems in ROUGE

  • Doesn't take themes or concepts into consideration (i.e,a low ROUGE score doesn't necessarily mean the translation is bad)

    image-20220227173030809

Summary

  • BLEU score compares"candidate" against "references"using an n-gram average
  • BLEU doesn't consider meaning or structure
  • ROUGE measures machine-generated text against an "ideal"reference

Sampling and decoding

Outline

  • Random sampling
  • Temperature in sampling
  • Greedy decoding
  • Beam search
  • Minimum Bayes'risk(MBR)

Greedy decoding

Selects the most probable word at each step But the best word at each step may not be the best for longer sequences..

image-20220227174332089

Random sampling

image-20220227174520407

Often a little too random for accurate translation!
Solution: Assign more weight to more probable words, and less weight to less probable words.

Temperature

In sampling, temperature is a parameter allowing for more or less randomness in predictions

Lower temperature setting =More confident, conservative network

Higher temperature setting =More excited, random network(and more mistakes)

image-20220227174817249

Beam search decoding

A broader,more exploratory decoding alternative

Selects multiple options for the best input based on conditional probability

Number of options depends on a predetermined beam width parameterB

Selects B number of best alternatives at each time step

e.g.

image-20220227181034808

Problems with beam search

Since the model learns a distribution, that tends to carry more weight than single tokens

Can cause translation problems,i.e. in a speech corpus that hasn't been cleaned

image-20220227181205930 image-20220227181309888

Minimum Bayes Risk(MBR)

Compares many samples against one another. To implement MBR:

  • Generate several random samples
  • Compare each sample against all the others and assign a similarity score(such as ROUGE!)
  • Select the sample with the highest similarity: the golden one

Example: MBR Sampling

To generate the scores for 4 samples:

  1. Calculate similarity score between sample 1 and sample2
  2. Calculate similarity score between sample 1 and sample3
  3. Calculate similarity score between sample 1 and sample 4
  4. average the score of the first 3 steps(Usually a weighted average)
  5. Repeat until all samples have overall scores

Summary

  • Beam search uses conditional probabilities and the beam width parameter
  • MBR(Minimum Bayes Risk)takes several samples and compares them against each other to find the golden one
  • Go forth to the coding assignment!

Transformer

Outline

  • Issues with RNNs
  • Comparison with Transformers

transformer v.s. RNN

Neural Machine Translation

image-20220227184039191

Seq2Seq Architectures

image-20220227184243572

v.s

transformer基于attention,每层不需要任何顺序计算,需要传递的梯度也只有一步,没有梯度消失问题

image-20220227184618458

Multi-headed attention

self-attention

image-20220227184736706

self-attention可以被理解为一个意图模型,每个输入的Q K V都包含一个dense layer

添加一组self-attention被称为heads

Multi-headed attention

image-20220227185425947

Positional Encoding

Transformers 还包含一个位置编码阶段,该阶段对序列中的每个输入位置进行编码,因为单词顺序和位置对于任何语言都非常重要。

image-20220227190409588

Summary

  • In RNNs parallel computing is difficult to implement
  • For long sequences in RNNs there is loss of information
  • In RNNs there is the problem of vanishing gradient
  • Transformers help with all of the above

Application

Outline

  • Transformers applications in NLP
  • Some Transformers
  • Introduction to T5

Transformers applications in NLP

image-20220227190827697

Some Transformers

image-20220227190950138

T5:Text-To-Text Transfer Transformer

image-20220227191724803

不需要为每一个任务单独训练模型,一个模型就可以完成这些任务

image-20220227191956334

Summary

  • Transformers are suitable for a wide range of NLP applications
  • GPT-2,BERT and T5 are the cutting-edge Transformers
  • T5 is a powerful multi-task transformer

dot production attention

image-20220227212947204

Queries, Keys and Values

image-20220227213528792

Concept of attention

image-20220227213704719

Attention math

image-20220227214020749

image-20220227214600928`

Summary

  • Dot-product Attention is essential for Transformer·The input to Attention are queries,keys,and values
  • A softmax function makes attention more focused on best keys
  • GPUs and TPUs is advisable for matrix multiplications

Causal Attention

Outline

  • Ways of Attention
  • Overview of Causal Attention
  • Math behind causal attention

Three ways of attention

image-20220227215942918

Causal attention

  • Queries and keys are words from the same sentence
  • Queries should only be allowed to look at words before
image-20220227220105873

Math

image-20220227220215369image-20220227220416623

image-20220227220416623

Summary

  • There are three main ways of Attention:Encoder/Decoder,Causal and Bi-directional type
  • In causal attention,queries and keys come from the same sentence and queries search among words before only

Multi-head attention

  • Each head uses different linear
    transformations to represent
    words

  • Different heads can learn different
    relationships between words

    image-20220227220742152
image-20220227220911025

Concatenation

image-20220227221134099

math

image-20220227221416675 image-20220227221501459

Summary

  • Different heads can learn different relationship between words
  • Scaled dot-product is adequate for Multi-Head Attention
  • Multi-Headed models attend to information from different
    representations at different positions

Transformer decoder

image-20220227222305258 image-20220227222443521 image-20220227222844956 image-20220227223427538

Summary

  • Transformer decoder mainly consists of three layers
  • Decoder and feed-forward blocks are the core of this model code
  • It also includes a module to calculate the cross-entropy loss

Transformer summarizer

Outline

  • Overview of Transformer summarizer
  • Technical details for data processing
  • Inference with a Language Model

overview

image-20220227225132762

data processing

1是<EOS>

image-20220227225625387

通过Loss weights focus on summary

cost function

image-20220227225716270

Inference with a Language Model

image-20220227225944252

Summary

  • For summarization,a weighted loss function is optimized
  • Transformer Decoder summarizes predicting the next word using
  • The transformer uses tokenized versions of the input
posted @ 2022-03-23 23:12  付玬熙  阅读(166)  评论(0编辑  收藏  举报