[吴恩达团队自然语言处理第3课_2]LSTM NER SiameseNetwork
[吴恩达团队自然语言处理第3课_2]LSTM NER SiameseNetwork
LSTM
Outline
- RNNs and vanishing/exploding gradients
- Solutions
RNNs
-
Advantages
- Captures dependencies within a short range
- Takes up less RAM than other n-gram models
-
Disadvantages
-
Struggles with longer sequences
-
Prone to vanishing or exploding gradients
-
Solving for vanishing or exploding gradients
-
Identity RNN with ReLU activation ,-1->0 将负值替换为0,使其更接近单位矩阵
\[\begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ \end{bmatrix} \] -
Gradient clipping 32->25 将>25的值剪裁到25,限制梯度的大小
-
Skip connections
Introduction
Outline
- Meet the Long short-term memory unit!
- LSTM architecture
- Applications
LSTMs:a memorable solution
-
Learns when to remember and when to forget
-
Basic anatomy:
- A cell state
- A hidden state with three gates
- Loops back again at the end of each time step
-
Gates allow gradients to flow unchanged
LSTMs: Based on previous understanding
Cell state=before conversation
接到朋友电话之前,想的是无关朋友的内容
Forget gate=beginning of conversation
接电话时,把无关的想法放在一边,保留我想要的任何内容Input gate=thinking of a response
在通话进行时,获得来自朋友的新信息,同时想接下来可能会谈什么
Output gate=responding
当你决定下一步要说什么
Updated cell state=after conversation
重复直到挂电话,记忆更新了几次
LSTM Basic Structure
![image-20220225164035498](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001230751-1941422098.png)
applications of LSTMs
![image-20220225164428453](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001231037-611024460.png)
Summary
- LSTMs offer a solution to vanishing gradients
- Typical LSTMs have a cell and three gates:
- Forget gate
- Input gate
- Output gate
LSTM architecture
Cell State, Hidden State
- Cell State 充当memory
- Hidden State 是做出预测的原因
![image-20220225175647632](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001231321-684026112.png)
The Forget Gate
指出什么要保留,什么要丢弃,通过sigmod
函数,值被压缩到 0-1 之间,越靠近0越应该被丢掉,接近1 被保留
![image-20220225180326198](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001231534-2075847235.png)
The Input Gate
更新状态,有两层,sigmod
层和tanh
层
sigmod
:采用先前隐藏状态、当前输入,选择要更新的值,值压缩到 0-1,越接近1重要性越高
tanh
:也是采用先前隐藏状态、当前输入,值压缩到 -1 到 1,有助于调节网络中的信息流
最后两个相乘得输出
![image-20220225180915665](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001231763-1251508631.png)
Calculating the Cell State
![image-20220225181224007](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001231996-558716148.png)
The Output Gate
决定下一个隐藏状态是什么
采用先前的隐藏状态、当前输入,通过sigmod
层
最近更新的 cell state
通过 tanh
层
接下来两个相乘得 h
![image-20220225181854987](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001232225-1546902173.png)
Summary
-
LSTMs use a series of gates to decide which information to keep:
-
Forget gate decides what to keep
-
Input gate decides what to add
-
Output gate decides what the next hidden state will be
-
-
One time step is completed after updating the states
Named Entity Recognition (NER)
Introduction
What is Named Entity Recognition
- Locates and extracts predefined entities from text
- Places,organizations,names,time and dates
Types of Entities
![image-20220225184529869](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001232477-1743010251.png)
![image-20220225184609571](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001232861-1727948881.png)
Example of a labeled sentence
![image-20220225185115648](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001233114-561285267.png)
Application of NER systems
- Search engine efficiency
- Recommendation engines
- Customer service
- Automatic trading
Training NERs process
Data Processing
Outline
- Convert words and entity classes into arrays
- Token padding
- Create a data generator
Processing data for NERs
-
Assign each class a number 为每个实体类分配唯一数字
-
Assign each word a number 为每个单词分别其实体类的数字
Token padding
For LSTMs, all sequences need to be the same size.
- Set sequence length to a certain number
- Use the <PAD>token to fill empty spaces
Training the NER
- Create a tensor for each input and its corresponding number
- Put them in a batch ---->64,128,256,512..
- Feed it into an LSTM unit
- Run the output through a dense layer
- Predict using a log softmax over K classes
![image-20220225194909968](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001233579-1464269947.png)
Layers in Trax
model = t1.Serial(
t1.Embedding(),
t1.LSTM(),
t1.Dense()
t1.LogSoftmax()
)
Summary
- Convert words and entities into same-length numerical arrays
- Train in batches for faster processing
- Run the output through a final layer and activation
Computing Accuracy
Evaluating the model
- Pass test set through the model
- Get arg max across the prediction array
- Mask padded tokens
- .Compare outputs against test labels
![image-20220225200349881](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001233793-637085867.png)
Summary
- If padding tokens,remember to mask them when computing accuracy
- Coding assignment!
Siamese Networks
Introduction
两个相同的神经网络组成的神经网络,最后合并
Question Duplicates
比比较单词序列的含义
What do Siamese Networks learn?
![image-20220225212946951](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001234236-400124015.png)
Siamese Networks in NLP
![image-20220225213043984](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001234609-699059699.png)
Architecture
![image-20220225213520499](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001234847-1540117408.png)
![image-20220225213624840](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001235084-313403015.png)
Cost function
Loss function
将问题How old are you
作为anchor
锚点,用于比较其他的问题
与锚点有相似的意义则为positive question
,没有则negative question
相似的问题,相似度会接近1,反之接近 -1
![image-20220225214348598](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001235315-137429650.png)
![image-20220225214437628](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001235560-1032688190.png)
Triplets
Triplets
![image-20220225214617386](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001235791-552153947.png)
如果给模型正损失值,模型将一次来更新权重得到改进;
如果给模型一个负损失值,就像告诉模型做的好,加大权重,所以不要给负值,如果损失值<0就得0
![image-20220225215157742](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001236018-1580941983.png)
Simple loss:
Non linearity:
Triplet Loss
将函数左移,如选取alpha=0.2,
如果相似度之间的差距很小,如-0.1,在加入了alpha之后结果>0,让模型可以从中学习
Alpha margin:
![image-20220225221411505](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001236196-1598076061.png)
Triplet Selection
当模型正确预测(A,P)
比(A,N)
更相似时,Loss=0
此时已不能从 Triplets 学到更多,可以选择更有效的训练,选择让模型出错的triplets而不是随机的,叫hard triplets
,是(A,N)
十分接近但是仍然小于(A,P)
的相似性
![image-20220225224024607](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001236414-245588591.png)
Compute the cost
Introduction
![image-20220225225512508](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001236670-167158939.png)
![image-20220225225855011](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001236948-1736879357.png)
d_model
是embedding
的维度,同时等于列数为5,batch_size
是行数为4
![image-20220225230450763](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001237365-1554995888.png)
绿色对角线是重复问题的相似度,会大于非对角线的值
橙色是不重复问题的相似度
![image-20220225230946228](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001237572-113308433.png)
mean negative:
mean of off-diagonal values in each row 每行的非对角线元素,如第一行不包括 0.9
closest negative:
off-diagonal value closest to (but less
than)the value on diagonal in each row
非对角线元素中最接近对角线元素值的元素,如第一行为 0.3
即改相似度为0.3的negative
示例对学习贡献最大
![image-20220225233726844](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001237775-367915800.png)
mean neg
通过仅对平均值的训练,减少噪声
(噪声是接近0的,即几个噪声值的平均值通常为0)
Closest_neg
:与negative example的余弦相似度之间的最小差异,
如果与alpha
的差异很小,就可以获得更大的Loss
,通过将训练重点放在产生更高损失值的示例上,让模型更快更新权重
接下来将两个Loss
相加
Hard Negative Mining
One Shot learning
Classification vs One Shot Learning
![image-20220225235608098](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001237986-1446446360.png)
如辨别这首诗是不是Lucas写的,如果分类,将这类诗集加入到类别中,就会变成 K+1
类,需要重新训练
而One shot learning
不是分类,在Lucas的诗集上训练,然后计算两个类别的相似性
Need for retraining 如在银行出现新签名不需要重新训练
设置阈值来判断是否同一类
![image-20220226000112401](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001238220-437694434.png)
Training Testing
Dataset
![image-20220226000310103](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001238516-1072788546.png)
Prepare Batches
同一个batch
的内容是不会重复的,但和另一个batch
对应位置的元素重复
![image-20220226000545957](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001238857-1430567200.png)
Siamese Model
两个子网的参数相同所以只训练一组权重
![image-20220226000803439](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001239127-392303232.png)
Create a subnetwork:
- Embedding
- LSTM
- Vectors
- Cosine Similarity
Testing
![image-20220226000938596](https://img2022.cnblogs.com/blog/1586717/202202/1586717-20220226001239331-2142988991.png)