【吴恩达团队自然语言处理第四课】NLP中的注意力模型

神经网络机器翻译

Outline

Introduction to Neural Machine Translation
Seq2Seq model and its shortcomings
Solution for the information bottleneck

Seq2Seq model

Introduced by Google in 2014
Maps variable-length sequences to fixed-length memory
LSTMs and GRUs are typically used to overcome the
vanishing gradient problem

The information bottleneck

由于编码器的隐藏状态大小固定，较长的序列在进入解码器的过程中成为瓶颈，所以对较短的序列表现好，较长则不好

Seq2Seq shortcomings

Variable-length sentences +fixed-length memory=
As sequence size increases,model performance decreases

One vector per word

每个vector携带一个信息而不是将所有内容处理后组成一个大向量，

但是此模型在memory和context方面都有明显缺陷，如何建立可以从长序列中准确预测的a time and memory model

Solution: focus attention in the right place

Prevent sequence overload by giving the model a way to focus on the likeliest words at each step 通过让模型在每个步骤中专注于最可能的词来防止序列过载
Do this by providing the information specific to each input word 通过提供特定于每个输入单词的信息来做到这一点

Alignment

Motivation for alignment

希望输入的单词对其输出的单词

Correctly aligned words are the goal:

Translating from one language to another
Word sense discovery and disambiguation

bank可能是financial institution也可能是riverbank,将单词翻译成另一种语言，根据该词在其他语言中的解释，来分辨出含义
Achieve alignment with a system for retrieving information step by
step and scoring it

Word alignment

英语翻译的单词比德语多，在对齐时需要识别单词之间的关系

Which word to pay more attention to?

为了正确对齐需要增加一层layer来使编码器了解哪些输入对预测更重要

Give some inputs more weight!

更粗的线条意味着更大的影响

Calculating alignment

Attention

Outline

Concept of attention for information retrieval
Keys,Queries,and Values

Information retrieval

信息检索假设您正在寻找您的钥匙。
你请你妈妈帮你找到他们。
她根据钥匙通常的位置来权衡可能性，然后告诉你最有可能的位置。
这就是Attention 正在做的事情：使用您的查询查找正确的位置，并找到钥匙。

Inside the attention layer

Attention

Keys and queries=1 matrix

with the words of one query(Q) as columns and the words of the keys(K) as the rows Value score(V) assigned based on the closeness of the match

Attention=Softmax(QK^T)V

Neural machine translation with attention

Flexible

attention For languages with differentgrammar structures,attention still looks at the correct token between them

Summary

Attention is an added layer that lets a model focus on what's important
Queries,Values,and Keys are used for information retrieval inside the Attention layer
This flexible system finds matches even between languages with very different grammatical structures

Step up for machine tranlation

Data in machine translation

Machine translation setup

State-of-the-art uses pre-trained vectors

Otherwise,represent words with a one-hot vector to create the input

Keep track of index mappings with word2ind and ind2word dictionaries

Use start-of and end-of sequence tokens:

Preparing to translate to German

1结尾很多0，用0填充为配对的词

Training a NMT attention

Teacher forcing
Model for NMT with attention

How to know predictions are correct?

Teacher forcing allows the model to "check its work"at each step

Or,compare its prediction against the real output during training

教师强制允许模型在每一步“检查其工作”，或者，在训练期间将其预测与实际输出进行比较结果：

Result:Faster,more accurate training

Teacher forcing: motivation

这一步的错误会导致下一步的预测更加错误，所以每一步的预测都要检查

预测的绿色矩形不会用于预测下一个绿色矩形，

而实际的输出是解码器的输入

Training NMT

pre-attention decoder将预测目标转换为不同的向量空间，即query vector,

pre-attention decoder将target token右移一位，in start of sentence token是每个序列开头的符号

右边的四层为decoder

Evaluation for machine translation

BLEU Score

Stands for Bilingual Evaluation Understudy

Evaluates the quality of machine-translated text by comparing "candidate" text to one or more "reference"translations.
Scores: the closer to 1, the better, and vice versa:

e.g.

How many words in the candidate column appear in the reference translations?

"I"appears at most once in both,so clip to one:

\[m_w=1\\ \]

\[\frac{(Sum\ over\ unique\ n-gram\ counts\ in\ the\ candidate) }{(total\ \#\ of\ words\ in\ candidate)} \]

BLEU score is great, but..

Consider the following:

BLEU doesn't consider semantic meaning
BLEU doesn't consider sentence structure:
"Ate I was hungry because!"

ROUGE

Recall-Oriented Understudy for Gisting Evaluation
Evaluates quality of machine text
Measures precision and recall between generated text and human-created text

与uni-gram结合使用的示例：

reference希望model预测出的理想结果

Recall =How much of the reference text is the system text capturing?
Precision=How much of the model text was relevant?

Problems in ROUGE

Doesn't take themes or concepts into consideration (i.e,a low ROUGE score doesn't necessarily mean the translation is bad)

Summary

BLEU score compares"candidate" against "references"using an n-gram average
BLEU doesn't consider meaning or structure
ROUGE measures machine-generated text against an "ideal"reference

Sampling and decoding

Outline

Random sampling
Temperature in sampling
Greedy decoding
Beam search
Minimum Bayes'risk(MBR)

Greedy decoding

Selects the most probable word at each step But the best word at each step may not be the best for longer sequences..

Random sampling

Often a little too random for accurate translation!
Solution: Assign more weight to more probable words, and less weight to less probable words.

Temperature

In sampling, temperature is a parameter allowing for more or less randomness in predictions

Lower temperature setting =More confident, conservative network

Higher temperature setting =More excited, random network(and more mistakes)

Beam search decoding

A broader,more exploratory decoding alternative

Selects multiple options for the best input based on conditional probability

Number of options depends on a predetermined beam width parameterB

Selects B number of best alternatives at each time step

e.g.

Problems with beam search

Since the model learns a distribution, that tends to carry more weight than single tokens

Can cause translation problems,i.e. in a speech corpus that hasn't been cleaned

Minimum Bayes Risk(MBR)

Compares many samples against one another. To implement MBR:

Generate several random samples
Compare each sample against all the others and assign a similarity score(such as ROUGE!)
Select the sample with the highest similarity: the golden one

Example: MBR Sampling

To generate the scores for 4 samples:

Calculate similarity score between sample 1 and sample2
Calculate similarity score between sample 1 and sample3
Calculate similarity score between sample 1 and sample 4
average the score of the first 3 steps(Usually a weighted average)
Repeat until all samples have overall scores

Summary

Beam search uses conditional probabilities and the beam width parameter
MBR(Minimum Bayes Risk)takes several samples and compares them against each other to find the golden one
Go forth to the coding assignment!

Transformer

Outline

Issues with RNNs
Comparison with Transformers

transformer v.s. RNN

Neural Machine Translation

Seq2Seq Architectures

v.s

transformer基于attention,每层不需要任何顺序计算，需要传递的梯度也只有一步，没有梯度消失问题

Multi-headed attention

self-attention

self-attention可以被理解为一个意图模型，每个输入的Q K V都包含一个dense layer

添加一组self-attention被称为heads

Multi-headed attention

Positional Encoding

Transformers 还包含一个位置编码阶段，该阶段对序列中的每个输入位置进行编码，因为单词顺序和位置对于任何语言都非常重要。

Summary

In RNNs parallel computing is difficult to implement
For long sequences in RNNs there is loss of information
In RNNs there is the problem of vanishing gradient
Transformers help with all of the above

Application

Outline

Transformers applications in NLP
Some Transformers
Introduction to T5

Transformers applications in NLP

Some Transformers

T5:Text-To-Text Transfer Transformer

不需要为每一个任务单独训练模型，一个模型就可以完成这些任务

Summary

Transformers are suitable for a wide range of NLP applications
GPT-2,BERT and T5 are the cutting-edge Transformers
T5 is a powerful multi-task transformer

dot production attention

Queries, Keys and Values

Concept of attention

Attention math

Summary

Dot-product Attention is essential for Transformer·The input to Attention are queries,keys,and values
A softmax function makes attention more focused on best keys
GPUs and TPUs is advisable for matrix multiplications

Causal Attention

Outline

Ways of Attention
Overview of Causal Attention
Math behind causal attention

Three ways of attention

Causal attention

Queries and keys are words from the same sentence
Queries should only be allowed to look at words before

Math

Summary

There are three main ways of Attention:Encoder/Decoder,Causal and Bi-directional type
In causal attention,queries and keys come from the same sentence and queries search among words before only

Multi-head attention

Each head uses different linear
transformations to represent
words
Different heads can learn different
relationships between words

Concatenation

math

Summary

Different heads can learn different relationship between words
Scaled dot-product is adequate for Multi-Head Attention
Multi-Headed models attend to information from different
representations at different positions

Transformer decoder

Summary

Transformer decoder mainly consists of three layers
Decoder and feed-forward blocks are the core of this model code
It also includes a module to calculate the cross-entropy loss

Transformer summarizer

Outline

Overview of Transformer summarizer
Technical details for data processing
Inference with a Language Model

overview

data processing

1是<EOS>

通过Loss weights focus on summary

cost function

Inference with a Language Model

Summary

For summarization,a weighted loss function is optimized
Transformer Decoder summarizes predicting the next word using
The transformer uses tokenized versions of the input

posted @ 2022-03-23 23:12 付玬熙阅读(166) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

不理不理左卫门

【吴恩达团队自然语言处理第四课】NLP中的注意力模型

【吴恩达团队自然语言处理第四课】NLP中的注意力模型

神经网络机器翻译

Seq2Seq model

The information bottleneck

Seq2Seq shortcomings

One vector per word

Solution: focus attention in the right place

Alignment

Motivation for alignment

Word alignment

Which word to pay more attention to?

Give some inputs more weight!

Calculating alignment

Attention

Information retrieval

Inside the attention layer

Attention

Neural machine translation with attention

Flexible

Summary

Step up for machine tranlation

Data in machine translation

Machine translation setup

Preparing to translate to German

Training a NMT attention

How to know predictions are correct?

Teacher forcing: motivation

Training NMT

Evaluation for machine translation

BLEU Score

ROUGE

Summary

Sampling and decoding

Greedy decoding

Random sampling

Temperature

Beam search decoding

Minimum Bayes Risk(MBR)

Summary

Transformer

transformer v.s. RNN

Neural Machine Translation

Seq2Seq Architectures

v.s

Multi-headed attention

Positional Encoding

Summary

Application

Transformers applications in NLP

Some Transformers

T5:Text-To-Text Transfer Transformer

Summary

dot production attention

Queries, Keys and Values

Concept of attention

Attention math

Summary

Causal Attention

Three ways of attention

Causal attention

Math

Summary

Multi-head attention

Concatenation

math

Summary

Transformer decoder

Transformer summarizer

overview

data processing

cost function

Inference with a Language Model

Summary

公告