We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
翻译:我们探索平原、非等级视野变异器(VIT)作为物体探测的主干网。 这个设计可以使原VIT结构在物体探测方面进行微调,而无需重新设计训练前的等级主干线。 我们的平面后骨检测器在微调方面只有最低限度的调整,就能取得竞争性的结果。 令人惊讶的是,我们观察到:(一) 从单一规模的地貌图(没有通用的FPN设计)建立一个简单的地貌金字塔就足够(没有通用的FPN设计);(二) 利用很少的跨窗口传播块来帮助窗口关注(不移动)就足够了。有了普通VIT的预先训练的显像自动控制器(MAE),我们的探测器,名叫ViTDet,可以与以前所有基于等级主干线的领先方法竞争,仅利用图像Net-1K的预培训,达到61.3 AP_框。 我们希望我们的研究能够吸引人们注意对平面波内探测器的研究。 ViTDet的代码可以在探测器2中找到。 ViTDet的代码。