Link of the Paper: https://arxiv.org/abs/1609.06647
A Correlative Paper: Show and Tell: A Neural Image Caption Generator (Link of the Paper: https://arxiv.org/abs/1411.4555)
Main Points ( Improvements Over the CVPR2015 Model ):
- Image Model Improvement: GoogLeNet ( 22 layers ) -> Batch Normalization Model.
- Image Model Fine Tuning: fine tuning the image model must be carried after the LSTM parameters have settled on a good language model.
- Scheduled Sampling: a fully guided scheme using the true previous word -> a less guided scheme which mostly uses the model generated word instead.
- Ensembling
- Beam Size Reduction: the best beam size turned out to be small: 3.