On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention

和用LSTM的方法对比,

Height pooling 
& LSTM 
Image 
Image 
ResNet 
Shallow 
LSTM 
-1 
Global feature 
decoder 
2D feature map 
(a) SAR (Li et al., 2019) 
SATRN 
encoder 
Transformer 
decoder 
"COL" 
"BMW" 
2D feature map 
2D feature map 
(b) SATRN (Ours)

transform相比主要区别在于编码器上,由3部分构成

1Shallow CNN,用于控制计算量

More specifically, the shallow CNN block consists of two 
convolution layers with 3><3 kernels, each followed by a 
max pooling layer with 2x2 kernel of stride 2. The re- 
sulting 1/4 reduction factor has provided a good balance in 
computation-performance trade-off in our preliminary stud- 
ies. If spatial dimensions are further reduced, performance 
drops heavily; if reduced less, computation burden for later 
self-attention blocks increases a lot.

2Adaptive 2D positional encoding

论文中说Transformer的Position Encoding模块可能在视觉作用中起不了作用,但是位置信息又很重要,尤其是论文致力于解决任意形状的文本识别问题,作者对位置编码进行了可学习的自适应,目的是

sinu 
P hu,' = 
(4)

p,2i 
sin(p/100002i/D 
cos(p/100002i/D 
(5)

= sigmoid (max(0, , 
= sigmoid (max(0, , 
(7) 
(8)

E是图像卷积特征,g是池化操作,然后经过线性层分别得到alpha和beta,再分别针对图像的h,w得到编码信息(按照Transformer位置编码方式)。

识别出的α和β直接影响高度和宽度位置编码,以控制水平轴和垂直轴之间的相对比率,以表达空间分集。通过学习从输入推断出α和β,A2DPE允许模型沿高度和宽度方向调整长度元素。
We visualize random input images from three groups with different predicted aspect ratios, as a by-product of A2DPE. Figure 7 shows the examples according to the ratios α/β. Low aspect ratio group, as expected, contains mostly horizontal samples, and high aspect ratio group contains mostly vertical samples. By dynamically adjusting the grid spacing, A2DPE reduces the representation burden for the other modules, leading to performance boost.

3Locality-aware feedforward layer

For good STR performance, a model should not only utilize long-range dependencies but also local vicinity around single characters.

作者认为transformer的自监督长在长距离的关系处理,local关系处理的并不够好,所以feedforward位置作者做了从ac的替换,提升相近特征间的交互。

 

512-d 
1 xl Conv, 2048 
IXI Conv, 512 
(a) Fully-connected 
512-d 
conv, 2048 
Conv, 512 
(b) Convolution 
512-d 
1 xl Conv, 2048 
Depthwise, 2048 
IXI Conv, 512 
(c) Separable 
Figure 4: Feedforward architecture options applied after the self- 
attention layer.

512-d的不同step的特征利用卷积进行特征交互,属于transformer对cv局部特征的一种融合,感觉应该有一定作用。

posted @ 2022-05-06 10:13  叠加态的猫  阅读(97)  评论(0编辑  收藏  举报