Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines.
翻译:事实证明,知识蒸馏是一种强大的模型压缩方法,有利于在实践中部署经过培训的语文模型。本文侧重于任务-神学蒸馏。它产生一个经过培训的紧凑模型,可以很容易地以少量计算成本和记忆足迹对各种任务进行微调。尽管具有实际效益,但任务-神学蒸馏具有挑战性。由于教师模型比学生模型具有比学生模型大得多的能力和更强的代表能力,因此学生很难作出与教师在大量开放式培训数据上相匹配的预测。这种巨大的预测差异往往会减少知识蒸馏的效益。为了应对这一挑战,我们建议采用具有迭接性理功能的新型任务-神学蒸馏法。具体地说,我们从教师模型开始,并反复地将学生的神经元放在目标宽度达到之前。这样一种方法在教师和学生在整个蒸馏过程中的预测之间维持了很小的差异,从而降低了知识蒸馏的效益。为了应对这一挑战,我们建议采用带有迭接作用的新的任务-神学提炼法,我们建议一种带有迭接力的新型知识的实验。