In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels. We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
翻译:在本文中,我们质疑自我监督的学习是否为视觉变异器(VIT)提供了新的属性,而这种变异器与进化网络(convnets)相比是突出的。 除了使自我监督的方法适应这一结构特别有效这一事实之外,我们提出以下看法:首先,自我监督的VIT特征包含关于图像的语义分解的明确信息,这种图像与受监督的VIT或convnets并不明显地产生。第二,这些特征也是出色的 k-NNN分类器,在图像网络上达到78.3%的顶端-1,与小VIT相比。我们的研究还强调了动力编码器、多作物培训和使用VITs小片段的重要性。我们把我们的发现转化为简单的自我监督方法,称为DINO,我们把它解释为一种没有标签的自我蒸馏形式。我们通过与Vit-Base的线性评价在图像网络上达到80.1%的顶端-1来显示DINO和VITs之间的协同作用。