We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 box AP on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code will be made available.
翻译:我们探索平原、非等级视野变异器(VIT)作为物体探测的主干网。 这个设计可以使原VIT结构在物体探测方面进行微调,而无需重新设计训练前的等级主干线。 微调的微调只有最低限度的调整, 我们的平面后骨探测器可以取得竞争性的结果。 令人惊讶的是, 我们观察到:( 一) 从单一规模的地貌图( 没有通用的FPN设计) 建立简单的地貌金字塔就足够了( 没有通用的FPN设计 ), (二) 使用窗口关注( 不移动) 辅助的跨窗口传播块就足够了。 普通的VIT主干线已经预先训练为蒙蔽自动控制器(MAE ), 我们的探测器, 名叫ViTDet 能够与以前所有基于等级主干线的领先方法竞争, 仅使用图像网-1K 预先训练, 达到61.3箱CO 数据元。 我们希望我们的研究能吸引人们注意对平面波内探测器的研究。 代码将会被提供。