This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings, achieving a new record under the same setting. We achieve 63.8 AP on COCO detection test-dev with a Swin-Large backbone. Our code will be made available at https://github.com/IDEA-Research/Stable-DINO.
翻译:本文关注于 DEtection TRansformers (DETR) 中不同解码器层之间的匹配稳定性问题。我们指出,DETR 中不稳定的匹配是由多个优化路径问题引起的,这个问题由 DETR 中的一对一匹配设计突出。为了解决这个问题,我们表明最重要的设计是使用并仅使用位置度量(例如 IOU)来监督正样本的分类分数。根据这一原则,我们提出了两种简单而有效的修改,通过将位置度量集成到 DETR 的分类损失和匹配成本中,命名为位置监督损失和位置调制成本。我们在几个 DETR 变体上验证了我们的方法。我们的方法比基线一直表现出稳定的提高。通过将我们的方法与 DINO 集成,我们在 ResNet-50 背景下使用 12 次和 24 次训练设置在 COCO 检测基准测试上实现了 50.4 和 51.5 AP,达到了相同设置下的新记录。我们在 COCO 检测测试-dev 中使用 Swin-Large 脊椎骨达到了 63.8 AP。我们的代码将在 https://github.com/IDEA-Research/Stable-DINO 上提供。