This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research.
翻译:本文没有描述一种新颖的方法。 相反,本文研究一个直截了当的、递增的、但必须知道的基准,因为计算机视野最近有了进步:视觉变异器的自我监督学习。 虽然标准革命网络的培训食谱非常成熟和健全,但VIT的食谱尚有待构建,特别是在培训更具挑战性的自我监督情景中。在这项工作中,我们回到基本情况,调查培训自我监督的VIT的若干基本组成部分的影响。我们发现不稳定是一个主要问题,会降低准确性,并且可以隐藏在明显的良好结果中。我们发现,这些结果确实是部分失败,如果培训更加稳定,这些结果是可以改进的。我们用MOCo v3和其他几个自我监督框架作为基准,并在各方面进行推理。我们讨论了目前积极的证据以及挑战和公开的问题。我们希望这项工作将为未来研究提供有用的数据点和经验。