注意力Attention机制!

注意力Attention机制!

一 综述 | 计算机视觉中的注意力机制

转自综述| 计算机视觉中的注意力机制
paper
https://github.com/MenghaoGuo/Awesome-Vision-Attentions

11月16日,清华大学计图团队和南开大学程明明教授团队、卡迪夫大学Ralph R. Martin教授合作,在ArXiv上发布关于计算机视觉中的注意力机制的综述文章[1]。该综述系统地介绍了注意力机制在计算机视觉领域中相关工作,并创建了一个仓库。

专门用于收集注意力机制的相关论文。

该综述论文的第一作者是胡事民教授的博士生国孟昊。他也是计图团队发布的点云Transformer(PCT)一文的第一作者;其他作者还包括清华大学张松海副教授、穆太江博士,以及清华和南开的多名博士生。

Part 1 研究背景

人类视觉系统可以自然高效地找到复杂场景中的重要的区域,受到这种现象的启发,注意力机制(Attention Mechanisms)被引入到计算机视觉系统中。注意力机制已经在计算机视觉的各种任务(如:图像识别、目标检测、语义分割、动作识别、图像生成、三维视觉等)中取得了巨大的成功。

但是,研究人员在研究不同任务的注意力机制的时候,往往注重的是任务本身,而忽略了注意力机制本身就是一个研究方向,是一个尝试用计算机视觉系统模拟人类视觉系统的研究方向。

该综述尝试从两个角度将视觉中不同任务中的注意力机制连接成一个整体——从注意力机制本身出发,对整个领域进行了系统地总结归纳,并给出了未来潜在的研究方向。

Part 2 什么是注意力机制?

意力机制可以理解为,计算机视觉系统在模拟人类视觉系统中可以迅速高效地关注到重点区域的特性。对于人类来说,当面对复杂场景的时候,我们可以迅速关注到重点区域,并处理这些区域。对于视觉系统,上述过程可以抽象成下面的式子:

\[\text { Attention }=f(g(x), x) \]

其中 g(x) 表示对输入特征进行处理并产生注意力的过程,f(g(x),x) 表示结合注意力对输入特征进行处理的过程。举两个具体的例子self-attentionSENet,对于 self-attention 来说,可以将上述公式具体化为:

\[\begin{aligned} Q, K, V &=\operatorname{Linear}(x) \\ g(x) &=\operatorname{Softmax}(Q K) \\ f(g(x), x) &=g(x) V \end{aligned} \]

对于 SENet来说,可以将上述公式具体化为:

\[\begin{aligned} g(x) &=\operatorname{Sigmoid}(\operatorname{MLP}(\operatorname{GAP}(x))) \end{aligned} \]

\[f(g(x), x)=g(x) x \]

接下来,该综述尝试将不同的注意力机制进行具体化,即明确 g 过程和 f 过程。这是该综述对注意力机制的第一个统一的角度:定义上的统一。

Part 3 视觉中注意力机制的发展过程

视觉中注意力机制的发展过程如图1所示:

视觉中的注意力机制可以粗略的分成四个部分:第一个部分是开始于RAM[4],特点是都使用了RNN网络进行产生注意力。第二个部分是开始于STN[5],特点是显式地预测重要的区域,代表性工作还有DCNs[6, 7]等。第三个部分是开始于SENet[3],特点是隐式地预测重要的部分,代表性工作还有CBAM[8] 等。第四个部分是自注意力机制相关的注意力方法,代表性工作有Non-Local[2], ViT[9]等。图2给出了这些方法的分类树。

Part 4 视觉注意力机制的分类

作者根据注意力方法本身对视觉中不同的注意力机制进行了分类,而不是根据不同的应用,从而对注意力机制的研究给了一个统一的视角。

如图三所示,作者根据注意力作用的不同维度将注意力分成了四种基本类型:通道注意力、空间注意力、时间注意力和分支注意力,以及两种组合注意力:通道-空间注意力和空间-时间注意力。

对于不同的注意力机制,他们有着不同的含义,比如对于通道注意力,它关注于选择重要的通道,而在深度特征图中,不同的通道往往表示不同的物体,所以它的含义是关注什么(物体),即what to attend

同理,空间注意力对应where to attend, 时间注意力对应when to attend,分支注意力对应which to attend。具体的注意力机制请参见论文。

Part 5 未来的研究方向

该综述文还提出了注意力机制方面七个潜在的研究方向,分别为:

  1. 注意力机制的充分必要条件
  2. 更加通用的注意力模块
  3. 注意力机制的可解释性
  4. 注意力机制中的稀疏激活
  5. 基于注意力机制的预训练模型
  6. 适用于注意力机制的优化方法
  7. 部署注意力机制的模型

参考文献

  1. M.-H. Guo, T.-X. Xu, J.-J. Liu, Z.-N. Liu, P.-T. Jiang, T.-J. Mu, S.-H. Zhang, R. R. Martin, M.-M. Cheng and S.-M. Hu, Attention Mechanisms in Computer Vision: A Survey,arXiv 2111.07624.
  2. X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, CVPR 2018, 7794-7803.
  3. J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks, IEEE TPAMI, 2020,Vol. 42, No. 8, 2011-2023
  4. V. Mnih, N. Heess, A. Graves, K. Kavukcuoglu, Recurrent models of visual attention, NeurIPS 2014,2204-2212.
  5. M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks, NeurIPS 2015, 2017-2025.
  6. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, ICCV 2017, 764-773.
  7. X. Zhu, H. Hu, S. Lin, J. Dai, Deformable convnets v2: More deformable, better results, CVPR 2019, 9308-9316.
  8. S. Woo, J. Park, J. Lee, and I. S. Kweon, CBAM: convolutional block attention module, ECCV 2018, 3-19.
  9. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR, 2021, 1-21.

二 Awesome-Vision-Attentions

Channel attention

  • Squeeze-and-Excitation Networks (CVPR 2018) pdf, (PAMI2019 version) pdf 🔥
  • Image superresolution using very deep residual channel attention networks (ECCV 2018) pdf 🔥
  • Context encoding for semantic segmentation (CVPR 2018) pdf 🔥
  • Spatio-temporal channel correlation networks for action classification (ECCV 2018) pdf
  • Global second-order pooling convolutional networks (CVPR 2019) pdf
  • Srm : A style-based recalibration module for convolutional neural networks (ICCV 2019) pdf
  • You look twice: Gaternet for dynamic filter selection in cnns (CVPR 2019) pdf
  • Second-order attention network for single image super-resolution (CVPR 2019) pdf 🔥
  • DIANet: Dense-and-Implicit Attention Network (AAAI 2020)pdf
  • Spsequencenet: Semantic segmentation network on 4d point clouds (CVPR 2020) pdf
  • Ecanet: Efficient channel attention for deep convolutional neural networks (CVPR 2020) pdf 🔥
  • Gated channel transformation for visual recognition (CVPR2020) pdf
  • Fcanet: Frequency channel attention networks (ICCV 2021) pdf

Spatial attention

  • Recurrent models of visual attention (NeurIPS 2014), pdf 🔥
  • Show, attend and tell: Neural image caption generation with visual attention (PMLR 2015) pdf 🔥
  • Draw: A recurrent neural network for image generation (ICML 2015) pdf 🔥
  • Spatial transformer networks (NeurIPS 2015) pdf 🔥
  • Multiple object recognition with visual attention (ICLR 2015) pdf 🔥
  • Action recognition using visual attention (arXiv 2015) pdf 🔥
  • Videolstm convolves, attends and flows for action recognition (arXiv 2016) pdf 🔥
  • Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition (CVPR 2017) pdf 🔥
  • Learning multi-attention convolutional neural network for fine-grained image recognition (ICCV 2017) pdf 🔥
  • Diversified visual attention networks for fine-grained object classification (TMM 2017) pdf 🔥
  • High-Order Attention Models for Visual Question Answering (NeurIPS 2017) pdf
  • Attentional pooling for action recognition (NeurIPS 2017) pdf 🔥
  • Non-local neural networks (CVPR 2018) pdf 🔥
  • Attentional shapecontextnet for point cloud recognition (CVPR 2018) pdf
  • Relation networks for object detection (CVPR 2018) pdf 🔥
  • a2-nets: Double attention networks (NeurIPS 2018) pdf 🔥
  • Attention-aware compositional network for person re-identification (CVPR 2018) pdf 🔥
  • Tell me where to look: Guided attention inference network (CVPR 2018) pdf 🔥
  • Pedestrian alignment network for large-scale person re-identification (TCSVT 2018) pdf 🔥
  • Learn to pay attention (ICLR 2018) pdf 🔥
  • Attention U-Net: Learning Where to Look for the Pancreas (MIDL 2018) pdf 🔥
  • Psanet: Point-wise spatial attention network for scene parsing (ECCV 2018) pdf 🔥
  • Self attention generative adversarial networks (ICML 2019) pdf 🔥
  • Attentional pointnet for 3d-object detection in point clouds (CVPRW 2019) pdf
  • Co-occurrent features in semantic segmentation (CVPR 2019) pdf
  • Factor Graph Attention (CVPR 2019) pdf
  • Attention augmented convolutional networks (ICCV 2019) pdf 🔥
  • Local relation networks for image recognition (ICCV 2019) pdf
  • Latentgnn: Learning efficient nonlocal relations for visual recognition(ICML 2019) pdf
  • Graph-based global reasoning networks (CVPR 2019) pdf 🔥
  • Gcnet: Non-local networks meet squeeze-excitation networks and beyond (ICCVW 2019) pdf 🔥
  • Asymmetric non-local neural networks for semantic segmentation (ICCV 2019) pdf 🔥
  • Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition (CVPR 2019) pdf
  • Second-order non-local attention networks for person re-identification (ICCV 2019) pdf 🔥
  • End-to-end comparative attention networks for person re-identification (ICCV 2019) pdf 🔥
  • Modeling point clouds with self-attention and gumbel subset sampling (CVPR 2019) pdf
  • Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification (arXiv 2019) pdf
  • L2g autoencoder: Understanding point clouds by local-to-global reconstruction with hierarchical self-attention (arXiv 2019) pdf
  • Generative pretraining from pixels (PMLR 2020) pdf
  • Exploring self-attention for image recognition (CVPR 2020) pdf
  • Cf-sis: Semantic-instance segmentation of 3d point clouds by context fusion with self attention (ACM MM 20) pdf
  • Disentangled non-local neural networks (ECCV 2020) pdf
  • Relation-aware global attention for person re-identification (CVPR 2020) pdf
  • Segmentation transformer: Object-contextual representations for semantic segmentation (ECCV 2020) pdf 🔥
  • Spatial pyramid based graph reasoning for semantic segmentation (CVPR 2020) pdf
  • Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation (CVPR 2020) pdf
  • End-to-end object detection with transformers (ECCV 2020) pdf 🔥
  • Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling (CVPR 2020) pdf
  • Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers (CVPR 2021) pdf
  • An image is worth 16x16 words: Transformers for image recognition at scale (ICLR 2021) pdf 🔥
  • Is Attention Better Than Matrix Decomposition? (ICLR 2021) pdf
  • An empirical study of training selfsupervised vision transformers (CVPR 2021) pdf
  • Ocnet: Object context network for scene parsing (IJCV 2021) pdf 🔥
  • Point transformer (ICCV 2021) pdf
  • PCT: Point Cloud Transformer (CVMJ 2021) pdf
  • Pre-trained image processing transformer (CVPR 2021) pdf
  • An empirical study of training self-supervised vision transformers (ICCV 2021) pdf
  • Segformer: Simple and efficient design for semantic segmentation with transformers (arxiv 2021) pdf
  • Beit: Bert pre-training of image transformers (arxiv 2021) pdf
  • Beyond Self-attention: External attention using two linear layers for visual tasks (arxiv 2021) pdf
  • Query2label: A simple transformer way to multi-label classification (arxiv 2021) pdf
  • Transformer in transformer (arxiv 2021) pdf

Temporal attention

  • Jointly attentive spatial-temporal pooling networks for video-based person re-identification (ICCV 2017) pdf 🔥
  • Video person reidentification with competitive snippet-similarity aggregation and co-attentive snippet embedding (CVPR 2018) pdf
  • Scan: Self-and-collaborative attention network for video person re-identification (TIP 2019) pdf

Branch attention

  • Training very deep networks (NeurIPS 2015) pdf 🔥
  • Selective kernel networks (CVPR 2019) pdf 🔥
  • CondConv: Conditionally Parameterized Convolutions for Efficient Inference (NeurIPS 2019) pdf
  • Dynamic convolution: Attention over convolution kernels (CVPR 2020) pdf
  • ResNest: Split-attention networks (arXiv 2020) pdf 🔥

ChannelSpatial attention

  • Residual attention network for image classification (CVPR 2017) pdf 🔥
  • SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning (CVPR 2017) pdf 🔥
  • CBAM: convolutional block attention module (ECCV 2018) pdf 🔥
  • Harmonious attention network for person re-identification (CVPR 2018) pdf 🔥
  • Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks (TMI 2018) pdf
  • Mancs: A multi-task attentional network with curriculum sampling for person re-identification (ECCV 2018) pdf 🔥
  • Bam: Bottleneck attention module(BMVC 2018) pdf 🔥
  • Pvnet: A joint convolutional network of point cloud and multi-view for 3d shape recognition (ACM MM 2018) pdf
  • Learning what and where to attend (ICLR 2019) pdf
  • Dual attention network for scene segmentation (CVPR 2019) pdf 🔥
  • Abd-net: Attentive but diverse person re-identification (ICCV 2019) pdf
  • Mixed high-order attention network for person re-identification (ICCV 2019) pdf
  • Mlcvnet: Multi-level context votenet for 3d object detection (CVPR 2020) pdf
  • Improving convolutional networks with self-calibrated convolutions (CVPR 2020) pdf
  • Relation-aware global attention for person re-identification (CVPR 2020) pdf
  • Strip Pooling: Rethinking spatial pooling for scene parsing (CVPR 2020) pdf
  • Rotate to attend: Convolutional triplet attention module, (WACV 2021) pdf
  • Coordinate attention for efficient mobile network design (CVPR 2021) pdf
  • Simam: A simple, parameter-free attention module for convolutional neural networks (ICML 2021) pdf

SpatialTemporal attention

  • An end-to-end spatio-temporal attention model for human action recognition from skeleton data (AAAI 2017) pdf 🔥
  • Diversity regularized spatiotemporal attention for video-based person re-identification (arXiv 2018) 🔥
  • Interpretable spatio-temporal attention for video action recognition (ICCVW 2019) pdf
  • A Simple Baseline for Audio-Visual Scene-Aware Dialog (CVPR 2019) pdf
  • Hierarchical lstms with adaptive attention for visual captioning (TPAMI 2020) pdf
  • Stat: Spatial-temporal attention mechanism for video captioning, (TMM 2020) pdf
  • Gta: Global temporal attention for video action understanding (arXiv 2020) pdf
  • Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification (CVPR 2020) pdf
  • Read: Reciprocal attention discriminator for image-to-video re-identification (ECCV 2020) pdf
  • Decoupled spatial-temporal transformer for video inpainting (arXiv 2021) pdf
  • Towards Coherent Visual Storytelling with Ordered Image Attention (arXiv 2021) pdf

三 External-Attention-pytorch





posted @ 2022-05-06 12:48  梁君牧  阅读(1088)  评论(0编辑  收藏  举报