Transformer in Computer Vision

Transformer in Computer Vision

2020-12-03 19:18:25

Survey 1A Survey on Visual Transformer, Kai Han, et al. [Paper

Survey 2: Transformers in Vision: A Survey, Salman Khan, et al. [Paper

[NEW] Survey 3A Survey of Visual Transformers Yang Liu et al. [Paper]

[NEW] Survey 4: Video Transformers: A Survey, Javier Selva et al. [Paper

 

 

 

  

1. Attention is all you need[J]. NIPS-2017. [Paper] [Code

 

2. End-to-End Object Detection with Transformers[J]. arXiv preprint arXiv:2005.12872, 2020. [Paper] [Code]  

 

3. RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder, NIPS 2020, [Paper] [Code

 

4. End-to-End Object Detection with Adaptive Clustering Transformer[J]. arXiv preprint arXiv:2011.09315, 2020. [Paper]

 

5. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers[J]. arXiv preprint arXiv:2011.09094, 2020. [Paper

 

6. Rethinking Transformer-based Set Prediction for Object Detection[J]. arXiv preprint arXiv:2011.10881, 2020. [Paper

 

7. Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper] [Code

 

8. ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis [Paper

 

9. End-to-end Lane Shape Prediction with Transformers [Paper]

 

10. End-to-End Video Instance Segmentation with Transformers [Paper]  

 

11. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[J]. arXiv preprint arXiv:2010.11929, 2020. [Paper] [Code

 

12. Pre-Trained Image Processing Transformer [Paper]

 

13. Few-shot Sequence Learning with Transformers, Lajanugen Logeswaran1 , Ann Lee2 , Myle Ott2 , Honglak Lee1 , Marc’Aurelio Ranzato2 , Arthur Szlam2 [Paper]

 

14. SceneFormer: Indoor Scene Generation with Transformers [Paper

 

15. PCT: Point Cloud Transformer, Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R. Martin, Shi-Min Hu [Paper] [Code]

 

16. Point Transformer, Hengshuang Zhao Li Jiang Jiaya Jia Philip Torr Vladlen Koltun [Paper] [Code]

 

17. Point Transformer Nico Engel, Vasileios Belagiannis, and Klaus Dietmayer [Paper] [Code]

 

18. A Generalization of Transformer Networks to Graphs, Vijay Prakash Dwivedi, Xavier Bresson [Paper

 

19. End-to-End Human Pose and Mesh Reconstruction with Transformers [Paper]  

 

20. Taming Transformers for High-Resolution Image Synthesis [Paper] [Project

 

21. 3D Object Detection with Pointformer, Xuran Pan1* Zhuofan Xia1* Shiji Song1 Li Erran Li2† Gao Huang [Paper

 

22. Training data-efficient image transformers & distillation through attention, [Paper] [Code

 

23. TransPose: Towards Explainable Human Pose Estimation by Transformer, Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang, [Paper] [Code]

 

24. TransTrack: Multiple-Object Tracking with Transformer, [Paper] [Code

 

25. SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks [Paper] [Code

 

26. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [Paper] [Code

 

27. TrackFormer: Multi-Object Tracking with Transformers [Paper

 

28. Trear: Transformer-based RGB-D Egocentric Action Recognition [Paper

 

29. General Multi-label Image Classification with Transformers [Paper

 

30. Feature Pyramid Transformer [Paper

 

31. End-to-end Lane Shape Prediction with Transformers [Paper

 

32. Bottleneck Transformers for Visual Recognition [Paper]

 

33. DEFT: Detection Embeddings for Tracking, Mohamed Chaabane, Peter Zhang, J. Ross Beveridge, and Stephen O’Hara, [Paper

 

34. RoI Tanh-polar Transformer Network for Face Parsing in the WildYiming LinJie ShenYujiang WangMaja Pantic, [Paper

 

35. An Image is Worth 16x16 Words, What is a Video Worth? [Paper] [] 

 

36. Vision Transformers for Dense Prediction, [Paper] [] 

 

 

 

 

Blogs: 

1. 《How Transformers work in deep learning and NLP: an intuitive introduction》[link

2. 《Transformers From Scratch》 [link

3. 《A Deep Dive Into the Transformer Architecture – The Development of Transformer Models》[link]

4. 《A Survey on Transformer Models in Machine Learning》[link]

5. 《Deep Learning for Natural Language Processing - YouTube》[link]

6. 《The Illustrated Transformer》[link]

7. 《The Transformer Family》[link]

8. 《The Annotated Transformer》[link]

9. Transformers [github]

 

 

 

Pre-training for Joint Computer Vision and Natural Language: 

 

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

 

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

 

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

 

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

 

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

 

Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

 

UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

 

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

 

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

 

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, arXiv 2020/04, ECCV 2020

 

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

 

ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

 

DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

 

SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

 

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

 

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

 

LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

 

Task-specific

 

VCR: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

 

TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

 

VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

 

VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

 

VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

 

Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

 

Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

 

Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

 

Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

 

Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

 

Other Analysis

 

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

 

Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

 

In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

 

In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

 

Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

 

Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

 

Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

 

Video-based VL-PTMs

 

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

 

Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

 

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

 

BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

 

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

 

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

 

ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

 

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

 

Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

 

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

 

Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

 

PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

 

Speech-based VL-PTMs

 

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models, arXiv 2019/06

 

Understanding Semantics from Speech Through Pre-training, arXiv 2019/09

 

SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering, arXiv 2019/10

 

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations, arXiv 2019/10

 

Effectiveness of self-supervised pre-training for speech recognition, arXiv 2019/11

 

Other Transformer-based multimodal networks

 

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

 

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

 

History for Visual Dialog: Do we really need it?, ACL 2020

 

Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

 

Other Resources

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

==

 

 

 

posted @ 2020-12-03 19:45  AHU-WangXiao  阅读(1885)  评论(2编辑  收藏  举报