A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.
翻译:最近在语言代表学习的成功背后有一个成本昂贵和记忆密集的计算神经网络。 知识蒸馏是将如此庞大的语言模型运用于资源匮乏环境中的主要技术,它转移了在个人字表述方面毫无限制地学到的知识。 本文的灵感来源于最近关于语言代表相对定位并具有更多语义知识的观点,我们提出了一个新的语言代表学习知识蒸馏目标,通过两种类型的语言代表关系传递背景知识:Word Relation and Diutle Conferation Relation。 与最近其他语言模型的提炼技术不同,我们的背景蒸馏对教师和学生之间的建筑变化没有任何限制。 我们验证了我们在语言理解任务挑战性基准方法上的有效性,不仅在各种大小的结构中,而且与最近提出的适应性大小调整方法DynaBERT相结合。