An-Survey-Vision-Transformer-2022

# A Survey on Vision Transformer #paper

1. paper-info

1.1 Metadata

Author:: [[Kai Han]], [[Yunhe Wang]], [[Hanting Chen]], [[Xinghao Chen]], [[Jianyuan Guo]], [[Zhenhua Liu]], [[Yehui Tang]], [[An Xiao]], [[Chunjing Xu]], [[Yixing Xu]], [[Zhaohui Yang]], [[Yiman Zhang]], [[Dacheng Tao]]
作者机构::
Keywords:: #DeepLearning , #Transformer , #Survey
Journal:: [[IEEE Transactions on Pattern Analysis and Machine Intelligence]]
Date:: [[2022]]
状态:: #Done

1.2 Abstract

Transformer, ﬁrst applied to the ﬁeld of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks. In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent networks. Given its high performance and less need for vision-speciﬁc inductive bias, transformer is receiving more and more attention from the computer vision community. In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing. We also include efﬁcient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer. Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
关键字：Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision , Video

1.3 Introduction

深度学习整体发展历程， CNN-> RNN-> Transformer
Transforms起源, transforms最先应用于NLP领域，取得了重大的成就。
Transforms在CV领域的成功。本篇摘要的重点

Inspired by the major success of transformer architectures in the field of NLP, researchers have recently applied transformer to computer vision (CV) tasks.

文章主要内容:

In this paper, we focus on providing a comprehensive overview of the recent advances in vision transformers and discuss the potential directions for further improvement.

分类（By application scenarios）

图 1-1 Representative works of vision transformers

主要分类：`backbone network`, `high/mid-level vision`,`low-level vision`, `video processing` `vision transformer`的发展时间线如图1-2

图 1-2 vision transformer 发展时间线

1.4 文章架构

图 1-3 文章结构导图

2. Formulation OF Transformer

图 2-1 Transformer 基础结构图

包含一个`Encoder`,`Decoder`, 每个编码器都由`self-attention layer`和`feed-forward neural network`组成。每个解码器在编码器的基础上多了`encoder-decoder attention layer`。

2.1 General Formulation of self-attention

2.2 Scaled Dot-product Self-Attention

Multi-Head Attention

2.3 Other Key Concepts in Transformer

Residual Connection in the Encoder and Decoder
Feed-Forward Network
Final Layer in the Decoder

3. Revisiting Transformers For NLP

图 3-1 基于Transformer的大型语言模型

3.1 BERT and its variants

3.2 Generative Pre-trained Transformer models

GPT
GPT2
GPT3

BioNLP Domain

Transformer应用深远，可拓展到其他领域。

The rapid development of transformer-based models on a variety of NLP-related tasks demonstrates its structural superiority and versatility, opening up the possibility that it will become a universal module applied in many AI fields other than just NLP. The following part of this survey focuses on the applications of transformer in a wide range of computer vision tasks that have merged over the past two years.

4. Vision Transformer

4.1 Backbone for Representation Learning

图 4-1 CNN + Transformer

图 4-2 各种模型测试结果(ImageNet)

图 4-3 FLOPs comparison of representative CNN and vision transformer models

`FLOPS`: 注意全大写，是`floating point operations per second`的缩写，意指每秒浮点运算次数，理解为计算速度。是一个衡量硬件性能的指标。 `FLOPs`: 注意s小写，是`floating point operations`的缩写（s表复数），意指浮点运算数，理解为计算量。可以用来衡量算法/模型的复杂度。

图 4-4 Throughput comparison of representative CNN and vision transformer models

4.1.1 Pure Transformer

ViT

图 4-5 The framework of the Vision Transformer

+ **Variants of ViT** + TNT + Twins and CAT + Region ViT + DeepViT + KVT + XCiT ##### 4.1.2 Transformer with Convolution 由于`transformer`缺少局部信息，而`CNN`优点就是处理局部信息。所以可以将二者相结合。 + CPVT：提出了一种`conditional positional encoding` > CPVT proposed a conditional positional encoding (CPE) scheme, which is conditioned on the local neighborhood of input tokens and adaptable to arbitrary input sizes, to leverage convolutions for fine-level feature encoding.

CvT, CeiT, LocalViT, CMT
LeViT：提出了一种hybrid neural network混合神经网络，可以快速进行图片分类。
BoTNet：在resnet网络的最后的bottleneck block用自注意力机制代替卷积块。
Visformer：反应了CNN和Transformer之间的差异。

4.1.3 Self-supervised Representation Learning

Generative Based Approach
iGPT
Contrastive Learning Based Approach
对比学习是计算机视觉领域的最受欢迎的自监督学习方法。

4.2 High/Mid-level Vision

4.2.1 Generic Object Detection

图 4-6 Transformer 目标探测框架

图 4-7 各框架性能对比

Transformer-based Set Prediction for Detection

DETR

图 4-8 The overall architecture of DETR

Transformer-based Backbone for Detection
Pre-training for Transformer-based Object Detection

4.2.2 Segmentation

Transformer for Panoptic Segmentation
Transformer for Instance Segmentation
Transformer for Semantic Segmentation
Transformer for Medical Image Segmentation

4.2.3 Pose Estimation

Transformer for Hand Pose Estimation
Transformer for Human Pose Estimation

4.2.5 Other Tasks

Pedestrian Detection
Lane Detection
Scene Graph.
Tracking
Re-Identification
Point Cloud Learning.

4.3 Low-level Vision

4.3.1 Image Generation

图 4-9 Diagram of Taming Transformer architecture

##### 4.3.2 Image Processing

图 4-10 Diagram of the IPT architecture

4.4 Video Processing

4.4.1 High-level Video Processing

Video Action Recognition
Video Retrieval
Video Object Detection
Multi-task Learning

4.4.2 Low-levle Video Processing

Frame/Video Synthesis
Video Inparnting

5. Conclusions and Discussions

5.1 Challenges

模型专一性，专门为CV设计的模型
模型通用性和鲁棒性
模型可解释性
模型高效性

5.2 Future Prospects

高效性和计算机性能之间的平衡
模型处理多任务
CNN/Transformer 模型选择

posted @ 2022-09-19 21:30 GuiXu40 阅读(184) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

guixu