Knowledge distillation is the process of transferring the knowledge from a large model to a small model. In this process, the small model learns the generalization ability of the large model and retains the performance close to that of the large model. Knowledge distillation provides a training means to migrate the knowledge of models, facilitating model deployment and speeding up inference. However, previous distillation methods require pre-trained teacher models, which still bring computational and storage overheads. In this paper, a novel general training framework called Self Distillation (SD) is proposed. We demonstrate the effectiveness of our method by enumerating its performance improvements in diverse tasks and benchmark datasets.
翻译:知识蒸馏是将知识从一个大模型转移到一个小模型的过程。在这个过程中,小模型学习了该大模型的通用能力,并保持了接近该大模型的性能。知识蒸馏提供了一种培训手段,以转移模型知识,促进模型的部署和加速推导。然而,以前的蒸馏方法需要经过预先训练的教师模型,这些模型仍然带来计算和储存间接费用。本文提议了一个称为“自我蒸馏”(SD)的新颖的一般培训框架。我们通过对不同任务和基准数据集的性能改进进行总结,以证明我们的方法的有效性。