We present a new model named Stacked-DETR(SDETR), which inherits the main ideas in canonical DETR. We improve DETR in two directions: simplifying the cost of training and introducing the stacked architecture to enhance the performance. To the former, we focus on the inside of the Attention block and propose the QKVA grid, a new perspective to describe the process of attention. By this, we can step further on how Attention works for image problems and the effect of multi-head. These two ideas contribute the design of single-head encoder-layer. To the latter, SDETR reaches great improvement(+1.1AP, +3.4APs) to DETR. Especially to the performance on small objects, SDETR achieves better results to the optimized Faster R-CNN baseline, which was a shortcoming in DETR. Our changes are based on the code of DETR. Training code and pretrained models are available at https://github.com/shengwenyuan/sdetr.
翻译:我们提出了一个名为Stacked-DETR(SDETR)(SDETR)的新模型,该模型继承了Cancial DETR的主要思想。我们从两个方面改进了DETR:简化培训成本和引入堆叠式建筑来提高绩效。对于前者,我们侧重于注意区块的内部,并提出QKVA网格,这是描述关注过程的新视角。这样,我们可以进一步探讨注意如何解决图像问题和多头的影响。这两个概念有助于设计单头编码器。对于后者,SDETR向DETR(特别是小物体的绩效)大有改进(+1.1AP,+3.4APs)。对于后者,SDETR在优化快速R-CNN基线方面取得了更好的结果,这是DETR的一个缺陷。我们的变化以DETR的代码为基础。培训代码和预先培训的模式可以在 https://github.com/shengwenyuan/sdetr查阅。