Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10\times$ fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.
翻译:视觉变异器正在改变物体探测方法的景观。 Vits 的自然用途是用一个基于变压器的骨干取代基于CNN的骨干,这是直接而有效的,其价格是给推断带来相当大的计算负担。更微妙的用途是 DETR 家族,它消除了对许多手工设计的物体探测部件的需求,但引入了一个需要超长时间才能聚合的解码器。结果,基于变压器的天体探测无法在大规模应用中达到。为了克服这些问题,我们提议用一个新型的无coder全变压器(DFFFT)天体探测器来取代基于CNN的骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干骨干,我们设计了一个基于低成本成本成本成本成本测试的测试模型,我们设计一个高层次的精精精度测试模型。