『论文』Deformable DETR
Deformable DETR
Deformable Detr: Deformable Transformers For End-To-End Object Detection
3 Revisiting Transformers And Detr#
There are two known issues with Transformers.
- One is Transformers need long training schedules before convergence. 初始化时会依据0 mean和 1 variance的distribution,于是attention weights
基本等于 ,大家都一样,都很小。It will lead to ambiguous gradients for input features. Thus, long training schedules are required so that the attention weights can focus on specific keys. - The computational and memory complexity for multi-head attention can be very high with numerous query and key elements. The multi-head attention module suffers from a quadratic complexity growth with the feature map size
DETR: built upon the Transformer encoder-decoder architecture, combined with a set-based Hungarian loss that forces unique predictions for each ground-truth bounding box via bipartite matching.
It also has its own issues.
- DETR has relatively low performance in detecting small objects. Modern object detectors use high-resolution feature maps to better detect small objects. However, high-resolution feature maps would lead to an unacceptable complexity for the self-attention module in the Transformer encoder of DETR
- DETR requires many more training epochs to converge.
4 Method#
4.1 Deformable Transformers For End-To-End Object Detection#
Deformable Attention Module.
Multi-scale Deformable Attention Module.
- Our proposed deformable attention module can be naturally extended for multi-scale feature maps.
Deformable Transformer Encoder.
- Both the input and output of the encoder are of multi-scale feature maps with the same resolutions.
- For each query pixel, the reference point is itself. To identify which feature level each query pixel lies in, we add a scale-level embedding, to the feature representation
Deformable Transformer Decoder.
- For each object query, the 2-d normalized coordinate of the reference point p̂_q is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.
4.2 Additional Improvements And Variants For Deformable Detr#
Iterative Bounding Box Refinement.
- Here, each decoder layer refines the bounding boxes based on the predictions from the previous layer.
Two-Stage Deformable DETR. - We explore a variant of Deformable DETR for generating region proposals as the first stage. The generated region proposals will be fed into the decoder as object queries for further refinement
其他#
Deformable DETR借鉴了DCN的思想,提出可变形注意力机制——每个特征像素不必与所有特征像素交互计算,只需要与部分基于采样获得的其它像素交互,并且这些采样点的位置是可学习的。这是一种局部(local)和稀疏(sparse)的高效注意力机制,能够解决DETR收敛慢与能够处理的特征分辨率受限的问题。与DETR详细对比的话,主要有以下不同:
1. 多尺度特征;
2. 新增scale-level embedding,用于区分不同特征层(由于第1点);
3. 使用了多尺度可变形注意力替代Encoder中的自注意力和Decoder中的交叉注意力;
4. 引入了参考点,某种程度上起到先验的作用;
5. 为自己开发了“高配”版:迭代的框校正策略 和 两阶段模式;
6. 检测头部的回归分支预测的是bbox偏移量而非绝对坐标值
作者:traviscui
出处:https://www.cnblogs.com/traviscui/p/16436196.html
版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」许可协议进行许可。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人