Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.
翻译:检测变异器在样品丰富的COCO数据集中取得了竞争性性能,然而,我们显示,大多数变异器在小型数据集(如城市景象)上出现显著的性能下降。换句话说,检测变异器一般都是数据饥饿。为解决这一问题,我们通过从数据效率高的RCNN变异器逐步过渡到具有代表性的DETR,对影响数据效率的因素进行了实验分析。经验结果表明,来自当地图像区的稀有特征取样具有关键位置。基于这一观察,我们通过简单地对交叉注意层中的关键和价值序列的构建方式进行交替,对现有的检测变异器的数据-饥饿问题进行缓解,同时对原始模型进行最低限度的修改。此外,我们采用了简单而有效的标签增强方法,以提供更丰富的监督和提高数据效率。实验表明,我们的方法可以很容易地适用于不同的检测变异器,并改进其在小型和抽样丰富的数据集上的性能。我们将在\url{https://github.com/encomber-1997/DE-TRs}公布代码。