Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications. One line of model compression approaches considers knowledge distillation to distill large teacher models into small student models. Most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We notice that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. Hence we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains and passes such knowledge to students. Specifically, we explicitly force the meta-teacher to capture transferable knowledge at both instance-level and feature-level from multiple domains, and then propose a meta-distillation algorithm to learn single-domain student models with guidance from the meta-teacher. Experiments on public multi-domain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in the settings where the training data is scarce.
翻译:培训前语言模式已应用于各种国家学习计划任务,取得了相当大的绩效成果;然而,由于模型规模庞大,加上长时间的推算时间,限制了实时应用中这类模式的部署;一行模型压缩方法考虑了知识蒸馏,将大型教师模式蒸馏成小型学生模式。这些研究大多只侧重于单领域,忽视了其他领域的可转让知识。我们注意到,培训一个在各领域消化的可转让知识教师能够实现更好的普及能力,以帮助知识蒸馏。因此,我们提议建立一个元知识蒸馏(Meta-KD)框架,以构建一个可捕捉跨领域可转让知识并将此类知识传递给学生的元教师模式。具体地说,我们明确迫使元教师从多个领域获取可转让的知识,在实例一级和特征一级,这忽略了从其他领域的可转让知识。我们注意到,我们随后提出一个元学习算式蒸馏算法,在元教师的指导下学习单类学生模式。关于公共多领域NLP任务的实验显示了拟议的Meta-K-D框架的有效性和优越性。我们还展示了Meta-K-D培训的稀缺环境。