We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
翻译:我们在本文中介绍了一种创新的取消培训方法,以加快DETR(DETR)培训的速度,并加深了对DETR类似方法的缓慢趋同问题的理解。我们表明,这种缓慢的趋同是由于两面图比对不稳,导致早期培训阶段的优化目标不一致。为了解决这一问题,除了匈牙利的损失外,我们的方法还补充了带有噪音的地面-真相捆绑箱,将其添加到变异器解码器中,并培训了重建原始箱的模式,这有效地减少了双面图匹配困难,并导致更快的趋同。我们的方法是通用的,可以通过增加数十条代码线以取得显著的改进而很容易地插入任何DETR类方法。结果,我们的DN-DETR在同一个环境下取得了显著的改进(+1.9美元 AP)并取得了最佳效果(AP 43.4美元和48.6美元,分别是1美元和50美元的培训),在ResNet-50美元的主干线中,与基准相比,D-DETR}/DETRR=Qs可与50美元基准相比, D-DETR_Qrx/Q_Q_