论文学习1——AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

INTRODUCTION

Self-attention-based architectures have become the model of choice in mission of natural language model

But in computer vision, convolution architectures  are still the mainstream way.

Some works have been inspired by NLP success and try to combine CNN-like architectures with self-attention

Some choose to replace the convolutions entirely.

The author split the image into patches and provide the sequence of linear embeddings of these patches as an input of a Transformer.

ViT do not generalize well when trained on insufficient amounts of data. However, Vit approaches or beats state of the art on multiple image recognition benchmarks when pre-trained at sufficient scale.

RELATED WORK

The application of self-attention to images require that each pixel attends to every other pixel. We can probably divide the approximations into two routes, one is appling the self-attention only in local neighborhoods for each other, the other using Sparse Transformers to build global self-attention to images.

The authors explore image recognition at larger scales than the standard ImageNet dataset. They use two datasets ImageNet-21k and JFT-300M

Method

1.Similar to BERT’s token, the authors prepend a learnable embedding to the sequence of embedded patches. And the classification head is consisted of one hidden layer at pre-training time and a single linear layer at fine-tuning time.

The authors just use 1D Position embeddings instead of 2D-aware position embeddings to retain position information. Next step, the resulting sequence of embedding vectors serves as input to the encoder.

Hybrid Architecture.The author proposed that the input sequence can be obtained just flatten a CNN feature map Hand project to the Transformer dimension.

Inductive bias. The authors note that Vision Transformer has much less image-specific inductive bias than CNNs. In Vit, MLP layers are local and translationlly equivariant like CNNs, while the self-attention layers are global.

2.About how to get higher resolution.

First, the authors remove the pre-trained prediction head and attach a zero-initialized \(D\times K\) feedforward layer to get higher resolution in fine-tune period. Second, they want to do something meaningful in pre-trained period. So they perform 2D interpolation of the pre-trained position embeddings.

EXPERIMENTS

The author consider the computational cost of pre-training the model, ViT performs very favourably.

First

1.DATASET. They use ImageNet-21k and JFT with 18k classes as the pre-training datasets. Thet also train on CIFAR-10/100, Oxford-IIIT Pets, Oxford Flowers-102 and VTAB datasets to serveral benchmark tasks.

2.Model Variants. The authors use “B“,”L” and “H“ to indicate the model size. Whatsmore, thay also replace the Batch Normalization layers with  Group Normalization and use standardized convolutions in ResNet modifications.

3.Training & Fine-tuning. They use Adam, linear learning rate warmup and decay in training period and SGD with momentum in fine-tuning period.

4.Metrics. They authors concentrate on two metrics on results on downstream datasets, few-shot and fine-tuning-accuracies. The former can be evaluated in a fast on-the-fly pattern.

Second

The authors choose two models which is the state of the art on ImageNet and other datasets reported. The one is BiT which is supervised model and the other is Noisy Student which is a semi-supervised model.

Through the experiments, the author verify that ViT outperforms ResNet-based baselines on all datasets.

(They use the standard of accuracy)

Third

The authors set two experiments, with pre-training ViT models on datasets of increasing size and random subsets of 9M, 30M, and 90M, eventually conclude that ViT performs well when pre-training on larger datasets and learning the relevent patterns directly from data is sufficient and beneficial for larger datasets.

So, the authors say that it is a exciting direction analyzing of few-shot properties of ViT.

Forth

1.Projection. ViT linearly projects the flattened patches into a lower-dimensional space. It is to add the learned position embedding to the patch representation.

2.Encoding. After the projection, the model learns to encode distance within the image.

3.Attention distance. It is analogous to receptive yield in CNNs. The authors find that ViT can integrate information globally.

CONCLUSION

ViT is not just using self-attention in computer vision. It indeed interprets an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP.

The authors say many challenges remain. One is to apply ViT to other computer vision tasks such as detection and segmentation.

Another is to continue exploring self-supervised pre-training.

posted on 2024-12-31 11:51  bnbncch  阅读(45)  评论(0)    收藏  举报