We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods. Code is available at \url{https://github.com/SlongLiu/DAB-DETR}.
翻译:在本文中,我们提出了一个使用DETR(DETR)动态定位箱的新查询配方,对查询在DETR(DETR)中的作用有更深入的了解。这种新配方直接使用箱式坐标作为变换器解码器的查询,并逐层动态地更新这些查询。使用箱式坐标不仅有助于使用明确的定位前缀来改进查询到功能的相似性,消除DETR(DETR)中缓慢的培训趋同问题,而且还使我们能够利用箱宽度和高度信息调整位置注意图。这种设计清楚地表明,DETR中的查询可以按级次级进行软ROI集合层的查询。结果,它导致在同一环境下,例如,AP 45.7-用ResNet50-DC5作为主干线,在50个小区进行培训。我们还进行了广泛的实验,以确认我们的分析并核实我们的方法的有效性。代码可在\url{https://github.com/SlongLu/DAB}查询模型中查到。