We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.
翻译:我们对比了对比学习(CL)和遮蔽图像建模(MIM)在表示和下游任务性能上的差异,探讨了自监督视觉Transformer(ViTs)的性质。特别地,我们证明了自监督ViTs具有以下特性:(1)CL训练自我注意力捕捉长程全局模式,比如目标形状,尤其是在ViT架构的后期层中。这种CL性质有助于ViTs在表示空间中线性分离图像。然而,这也使得自我注意力坍缩为所有查询标记和头部的同质性。这种自我注意力的同质性降低了表示的多样性,恶化了可扩展性和密集预测性能。(2)CL利用表示的低频信号,但MIM利用高频信号。由于低频和高频信息分别代表形状和纹理,因此CL更加注重形状,而MIM更加注重纹理。(3)CL在后期层起着至关重要的作用,而MIM主要聚焦于早期层。通过这些分析,我们发现CL和MIM可以相互补充,并观察到即使是简单的协调也可以帮助利用两种方法的优点。代码可在https://github.com/naver-ai/cl-vs-mim获取。