Previous object detectors make predictions based on dense grid points or numerous preset anchors. Most of these detectors are trained with one-to-many label assignment strategies. On the contrary, recent query-based object detectors depend on a sparse set of learnable queries and a series of decoder layers. The one-to-one label assignment is independently applied on each layer for the deep supervision during training. Despite the great success of query-based object detection, however, this one-to-one label assignment strategy demands the detectors to have strong fine-grained discrimination and modeling capacity. To solve the above problems, in this paper, we propose a new query-based object detector with cross-stage interaction, coined as StageInteractor. During the forward propagation, we come up with an efficient way to improve this modeling ability by reusing dynamic operators with lightweight adapters. As for the label assignment, a cross-stage label assigner is applied subsequent to the one-to-one label assignment. With this assigner, the training target class labels are gathered across stages and then reallocated to proper predictions at each decoder layer. On MS COCO benchmark, our model improves the baseline by 2.2 AP, and achieves 44.8 AP with ResNet-50 as backbone, 100 queries and 12 training epochs. With longer training time and 300 queries, StageInteractor achieves 51.1 AP and 52.2 AP with ResNeXt-101-DCN and Swin-S, respectively.
翻译:目前的目标检测器通常基于密集的网格点或大量的预设anchors进行预测。这些检测器中的大多数采用一对多标签分配策略进行训练。相比之下,最近基于查询的目标检测器依赖于一组稀疏的可学习查询和一系列的解码器层。在训练过程中,每个层上都会独立地应用一对一的标签分配进行深度监督。虽然基于查询的目标检测方法在实践中取得了巨大的成功,然而这种一对一的标签分配策略要求检测器具有强大的细粒度区分和建模能力。为了解决上述问题,本文提出了一种新的基于查询和跨阶段交互的目标检测器,称为StageInteractor。在正向传播过程中,我们提出了一种有效的方法,通过使用轻量级适配器对动态算子进行重用来改善建模能力。至于标签分配,我们采用一种跨阶段标签分配器,该分配器紧随一对一的标签分配器之后。通过这个分配器,训练的目标类标签在各个解码器层之间进行收集,然后重新分配给适当的预测。在MS COCO基准测试中,我们的模型将基线AP提高了2.2个百分点,使用ResNet-50作为骨干网络、100个查询和12个训练周期时达到了44.8 AP。使用更长的训练时间和300个查询,StageInteractor分别在ResNeXt-101-DCN和Swin-S上实现了51.1 AP和52.2 AP。