Self-supervised learning has been widely applied to train high-quality vision transformers. Unleashing their excellent performance on memory and compute constraint devices is therefore an important research topic. However, how to distill knowledge from one self-supervised ViT to another has not yet been explored. Moreover, the existing self-supervised knowledge distillation (SSKD) methods focus on ConvNet based architectures are suboptimal for ViT knowledge distillation. In this paper, we study knowledge distillation of self-supervised vision transformers (ViT-SSKD). We show that directly distilling information from the crucial attention mechanism from teacher to student can significantly narrow the performance gap between both. In experiments on ImageNet-Subset and ImageNet-1K, we show that our method AttnDistill outperforms existing self-supervised knowledge distillation (SSKD) methods and achieves state-of-the-art k-NN accuracy compared with self-supervised learning (SSL) methods learning from scratch (with the ViT-S model). We are also the first to apply the tiny ViT-T model on self-supervised learning. Moreover, AttnDistill is independent of self-supervised learning algorithms, it can be adapted to ViT based SSL methods to improve the performance in future research. The code is here: https://github.com/wangkai930418/attndistill
翻译:自我监督的学习方法被广泛用于培训高质量的视觉变压器。 因此,在记忆和计算限制装置上释放其优秀的性能是一个重要的研究课题。 然而, 如何将知识从一个自监督的ViT蒸馏到另一个ViT的实验尚未探索。 此外, 现有的以ConvNet为基础的知识蒸馏(SSKD)方法对于ViT知识蒸馏来说并不理想。 在本文中, 我们研究自我监督的视觉变压器( ViT- SSKD) 的知识蒸馏。 我们显示, 从教师到学生的关键关注机制直接提取信息可以大大缩小两者之间的性能差距。 在图像Net- Subset 和图像Net-1K 的实验中, 我们显示, 我们的方法Attinst 蒸馏出现有的以ConvyNet为基础的知识蒸馏(SSKD) 方法, 与自我监督的学习方法( SSL- SSSKKKKD) 相比, 我们把关键关注机制的信息从抓取方法( 与 ViT-SVT-SVis-Slev Vain 模型的模型模型) 学习方法应用到Silevildal- 。 我们也是在Silu- silualshi 学习的自我学习中的第一个自我研究方法。