Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications. Typical approaches consider knowledge distillation to distill large teacher models into small student models. However, most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We argue that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. To this end, we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains inspired by meta-learning and use it to pass knowledge to students. Specifically, we first leverage a cross-domain learning process to train the meta-teacher on multiple domains, and then propose a meta-distillation algorithm to learn single-domain student models with guidance from the meta-teacher. Experiments on two public multi-domain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. We also demonstrate the capability of Meta-KD in both few-shot and zero-shot learning settings.
翻译:培训前语言模式已应用于各种国家学习计划任务,取得了相当大的绩效成果。然而,由于模型规模庞大,加上长时间的推算时间,限制了实时应用中这类模式的部署。典型的方法是考虑知识蒸馏,将大型教师模式蒸馏成小型学生模式。然而,这些研究大多只侧重于单域,忽视了其他领域的可转让知识。我们主张,培训一个在各领域消化了可转让知识的教师,可以实现更好的普及能力,帮助知识蒸馏。为此,我们提议建立一个元知识蒸馏(Meta-KD)框架,以建立一个元教师模式,在由元学习启发的各个领域收集可转让知识,并将知识传递给学生。具体地说,我们首先利用一个交叉学习进程,在多个领域培训元教师,然后提出一个元蒸馏算法,在元教师的指导下学习单域学生模式。两个公共多计量NLP(Meta-KD)任务实验显示拟议的Meta-K-D框架的有效性和优越性。我们还在几个零D学习环境中展示了Meta-Masima-q-shor-shor-sh-shown-showard。