更少为: 语言模型压缩的注意任务图层蒸馏法 (Less is More: Task-aware Layer-wise Distillation for Language Model Compression)

Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios.

翻译：地层蒸馏是将大型模型(即教师模型)压缩成小模型(即学生模型)的有力工具。学生通过模仿教师在中间每一层的隐蔽表现来蒸馏教师的知识。然而,从层层蒸馏是困难的。由于学生的模型能力小于教师,因此它往往配不上。此外,教师的隐蔽表现包含多余的信息,而学生不一定需要用于目标任务学习。为了应对这些挑战,我们建议了一个新的任务认知的浅层蒸馏(TED)。TED设计了任务觉悟过滤器,以协调学生和教师在每一层的隐蔽表现。过滤器从隐藏的表述中选择了对目标任务有用的知识。因此,TED缩小了两个模型之间的知识差距,帮助学生更好地适应目标任务。我们用两种情景来评估TED:持续的预培训和微调。TED展示了两种情景中的现有蒸馏方法的重大和一致的改进。

相关内容

TED

关注 19

TED（指 Technology、Entertainment、Design 在英语中的缩写，即技术、娱乐、设计）是美国的一家私有非营利机构，该机构以它组织的 TED 大会著称。每年3月，TED大会在美国召集众多科学、设计、文学、音乐等领域的杰出人物，分享他们关於技术、社会、人的思考和探索。TED演讲的特点是毫无繁杂冗长的专业讲座，观点响亮，开门见山，种类繁多，看法新颖。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日