Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios.
翻译:地层蒸馏是将大型模型(即教师模型)压缩成小模型(即学生模型)的有力工具。学生通过模仿教师在中间每一层的隐蔽表现来蒸馏教师的知识。然而,从层层蒸馏是困难的。由于学生的模型能力小于教师,因此它往往配不上。此外,教师的隐蔽表现包含多余的信息,而学生不一定需要用于目标任务学习。为了应对这些挑战,我们建议了一个新的任务认知的浅层蒸馏(TED)。TED设计了任务觉悟过滤器,以协调学生和教师在每一层的隐蔽表现。过滤器从隐藏的表述中选择了对目标任务有用的知识。因此,TED缩小了两个模型之间的知识差距,帮助学生更好地适应目标任务。我们用两种情景来评估TED:持续的预培训和微调。TED展示了两种情景中的现有蒸馏方法的重大和一致的改进。