State-of-the-art frameworks in self-supervised learning have recently shown that fully utilizing transformer-based models can lead to performance boost compared to conventional CNN models. Striving to maximize the mutual information of two views of an image, existing works apply a contrastive loss to the final representations. Motivated by self-distillation in the supervised regime, we further exploit this by allowing the intermediate representations to learn from the final layer via the contrastive loss. Through self-distillation, the intermediate layers are better suited for instance discrimination, making the performance of an early-exited sub-network not much degraded from that of the full network. This renders the pretext task easier also for the final layer, lead to better representations. Our method, Self-Distilled Self-Supervised Learning (SDSSL), outperforms competitive baselines (SimCLR, BYOL and MoCo v3) using ViT on various tasks and datasets. In the linear evaluation and k-NN protocol, SDSSL not only leads to superior performance in the final layers, but also in most of the lower layers. Furthermore, positive and negative alignments are used to explain how representations are formed more effectively. Code will be available.
翻译:在自我监督的学习中,最先进的框架最近显示,充分利用基于变压器的模型可以比传统的CNN模型导致性能提升。为了尽量扩大对图像两种观点的相互信息,现有作品对最后的表述适用对比性损失。受监督的制度中自我蒸馏的驱动,我们进一步利用这一点,允许中间演示通过对比性损失从最后一层中学习。通过自我蒸馏,中间层更适合例如歧视,使早期推出的子网络的表现与整个网络相比没有多大的退化。这使得最后一层的借口任务也更容易,导致更好的表述。我们的方法、自我提炼的自我强化学习(SDSSL),超越了竞争性基线(SIMCLR、BYOL和MOCO v3),在各种任务和数据集中使用VIT。在线性评估和 k-NNon协议中,SDSL的绩效不仅导致最后一层的优异性,而且在大多数下层中也能够更有效地解释如何形成正面和负面的调整。此外,将更有效地解释如何使用正面和消极的调整。