Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
Maintained by WANG Yue (yuewang@cse.cuhk.edu.hk). Last update on 2020/03/26.
Source: https://github.com/yuewang-cuhk/awesome-vision-language-pretraining-papers
Table of Contents
Image-based VL-PTMs
Representation Learning
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08 [code]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)
UNITER: Learning Universal Image-text Representations, arXiv 2019/09 [code]
Task-specific
VCR: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)
TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)
VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, arXiv 2019/12, [code], (VisDial-BERT)
VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)
Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01
Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03
Other Analysis
Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]
Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]
Video-based VL-PTMs
VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019
Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02
Other Resources
- Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
- Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, arXiv 2019
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
- Other repositories of relevant reading list