In the past few years, transformers have achieved promising performances on various computer vision tasks. Unfortunately, the immense inference overhead of most existing vision transformers withholds their from being deployed on edge devices such as cell phones and smart watches. Knowledge distillation is a widely used paradigm for compressing cumbersome architectures via transferring information to a compact student. However, most of them are designed for convolutional neural networks (CNNs), which do not fully investigate the character of vision transformer (ViT). In this paper, we utilize the patch-level information and propose a fine-grained manifold distillation method. Specifically, we train a tiny student model to match a pre-trained teacher model in the patch-level manifold space. Then, we decouple the manifold matching loss into three terms with careful design to further reduce the computational costs for the patch relationship. Equipped with the proposed method, a DeiT-Tiny model containing 5M parameters achieves 76.5% top-1 accuracy on ImageNet-1k, which is +2.0% higher than previous distillation approaches. Transfer learning results on other classification benchmarks and downstream vision tasks also demonstrate the superiority of our method over the state-of-the-art algorithms.
翻译:在过去几年里,变压器在各种计算机视觉任务中取得了有希望的成绩。 不幸的是,大多数现有变压器的巨大光学间接率使得它们无法在诸如手机和智能手表等边缘设备上部署。 知识蒸馏是通过向紧凑学生传递信息压缩繁琐建筑的一种广泛使用的范例。 但是,大多数变压器是为没有全面调查视觉变压器特性的超动神经网络设计的。 在本文中,我们使用补丁级信息并提出微微微的多元蒸馏法。 具体地说, 我们训练了一个小学生模型, 以匹配在补接层多功能空间中经过预先训练的教师模型。 然后, 我们把多重匹配损失分为三个术语, 仔细设计以进一步降低补接关系计算成本。 采用拟议方法, 包含5M参数的DeiT-Tiny模型在图像Net-1k上实现了76.5%的顶级-1精确度, 其比先前的蒸馏方法高出2.0%。 转移其他分类基准和下游视觉任务方面的学习结果, 也展示了我们先前的升级方法的优越性。