Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10$x fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.
翻译:视觉变异器正在改变物体探测方法的景观。 ViTs 的自然探测用功能是用一个基于变压器的骨干取代CNN的骨干,这种骨干是直截了当的、有效的,其价格是给推断带来相当大的计算负担。更微妙的用途是 DERR家族,它消除了对许多手工设计的部件在物体探测中的需求,但引入了一个需要超长时间的解码器。因此,在大规模应用中,基于变压器的天体探测不可能在大型应用中占上风。为了克服这些问题,我们提议用一个新型的无腐蚀器全变压器(DFFFT)天体主干骨干,在培训和推断两个阶段中,我们将反对检测简化为只使用编码器的单级锚基的预测问题,在两个切换点中,消除了低级的离子解码解码器,利用两个强大的解码器来保持单一级地谱地图预测的准确性;为了克服这些问题,我们提议用有限的计算资源来探索低级的低级的、低级变异变变的网络探测器探测器探测器探测器,在低的计算资源中,在培训和测基的轨道上,我们设计了一个高压的精基的精度的精度测试的精度测试的精度的精度测试中,同时,在高的精度的精度的精度的精度的精度的精度的精度的精度测试级的精度测试级的精度测试级的精度测试。