(持续更新,目前找工作中)
1.
Sequence to Sequence Learning with Neural Networks(2014 Google Research)
However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.
这篇文章基本上是重要的基石,上面这段的经验也被认为是大体正确的。YES - good res as easy “establishing communication” using SGD/ADAM etc do the BP and improve overall
much larger hidden state make the model better, as they choose to use deep LSTM instead of shalow one...
We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to represent a sentence.
文章实验细节,4 X 1000 X 2 = 8k
- 1st init each cell in 1000 LSTM cells , all the same 8/100
- fixed lr = 0.7 and 7.5 epochs used and halving lr after 5 epochs, actually it can be modified today
-
LSTMs tend to not suffer from the vanishing gradient problem, they can have
exploding gradients. Author enforced a hard constraint on the norm of the gradient [10,
-
25] by scaling it when its norm exceeded a threshold. For each training batch, we compute
s=∥g∥ ,where g is the gradient divided by 128. Ifs>5,we set g= 5g/s.
-
minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences
- so have to normlize the length of sentence taken each time
-