This paper studies the model compression problem of vision transformers. Benefit from the self-attention module, transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources including memory usage and the inference complexity. Compared with the existing knowledge distillation approaches, we propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches. We then explore an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and random-selected manifolds in teacher and student models. Experimental results conducted on several benchmarks demonstrate the superiority of the proposed algorithm for distilling portable transformer models with higher performance. For example, our approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which outperforms other ViT distillation methods.
翻译:本文研究视觉变压器的模型压缩问题。 从自我注意模块中受益, 变压器结构显示许多计算机的视觉任务有非凡的性能。 虽然网络性能得到提升, 但变压器往往需要更多的计算资源, 包括内存使用和推推力的复杂性。 与现有的知识蒸馏方法相比, 我们提议通过图像和分割补丁之间的关系, 从教师变压器中挖掘有用的信息。 然后我们探索一种高效的精细裁剪精细的蒸馏方法, 既计算跨图像、 交叉匹配, 也同时计算教师和学生模型中随机选择的多元。 在几个基准上进行的实验结果显示, 以更高性能的方式蒸馏移动式变压器模型的拟议算法的优势。 例如, 我们的方法在图像Net-1k数据集中实现了75.06% Top-1 精确度, 用于培训Deit-Tiny 模型, 后者比 VIT 的其他蒸馏方法要强。