Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead.
翻译:多尺度特征已经证明在目标检测中非常有效,但对于最近的基于Transformer的检测器,往往需要巨大甚至是不可接受的额外计算成本。在本文中,我们提出了迭代式多尺度特征聚合(Iterative Multi-scale Feature Aggregation,IMFA)——一种通用的范例,可以实现在基于Transformer的目标检测器中高效利用多尺度特征。其核心思想是利用来自仅少量关键位置的稀疏多尺度特征,并且它是通过两种新颖的设计实现的。首先,IMFA重新排列了Transformer编码器-解码器管道,以便编码特征可以根据检测预测进行迭代更新。其次,IMFA在先前的检测预测的指导下,从仅少量的关键点位置中稀疏采样比例自适应的特征,以用于精细检测。因此,采样的多尺度特征尽管稀疏,对目标检测仍然非常有益。大量实验显示,所提出的IMFA显著提高了多个基于Transformer的目标检测器的性能,但仅具有轻微的计算开销。