Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).
翻译:蒸馏工作导致语言模式更加紧凑、效率更高,而没有严重性能下降; 蒸馏标准方法为学生模式培训了符合两个目标的学生模式:任务特定目标(例如语言模型)和鼓励学生模式隐藏状态与大教师模式相似的模仿目标; 在本文中,我们表明,加强蒸馏是有益的,第三个目标鼓励学生通过交流干预培训模仿教师的因果计算过程; IT推动学生模式成为教师模式的因果抽象化,这是一个具有相同因果结构的简单模型; IIT完全可区分、易于执行,并与其他目标灵活结合; 与标准蒸馏方法相比,通过IIT蒸馏的结果是降低对维基百科(大规模语言模型)的混乱程度,并显著改进GLUE基准(自然语言理解)、SQUAD(问答)和CONLL-2003(实体识别)。