Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.
翻译:用于测试预培训的网络参数是否带来效益,例如提高精确度或培训速度等,探测物体是一项中下游任务。物体探测方法的复杂性使得这种基准在新建筑,如Viforageer(Viet)模型等新建筑的到来时,不具有三重性。这些困难(例如建筑不兼容、慢培训、高记忆消耗、未知培训公式等)使得最近的研究无法以标准ViT模型作为检测转移学习的基准。在本文件中,我们介绍了克服这些挑战的培训技术,使标准ViT模型成为Mas R-CNN的主干。这些工具便利了我们研究的主要目标:我们比较了五个ViT初始化,包括最近的最先进的自我监督学习方法、受监督的初始化和强的随机初始化基线。我们的结果显示,最近以遮蔽为基础的未经监督的学习方法可能首次为COCO提供令人信服的转移学习改进,将框 AP 提高到4%(绝对值),超过了监督的和先前自我监督的训练前模型的骨干。此外,这些掩码基础的初始化方法也随着规模的扩大而不断提高。