While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages. To this end, we study the transferability of several source tasks, augmentation resources and model architecture for distillation. We evaluate our model performance on multiple tasks, including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD question answering dataset and a massive multi-lingual NER dataset with 41 languages.
翻译:虽然深层次和大型的预先培训模式是各种自然语言处理任务的最先进的先进模型,但其庞大规模对资源限制环境中的实际使用提出了重大挑战。最近的知识蒸馏工程提出了压缩这些模型的任务不可知性和具体任务的方法,任务特定模型往往产生更高的压缩率。在这项工作中,我们开发了一个新的任务不可知的蒸馏框架XtremeDistrifters, 利用任务特定方法的优势, 学习一个可用于任意任务和语言的小型通用模型。为此,我们研究若干源任务、增强资源和蒸馏模型结构的可转让性。我们评估了我们多重任务的模式绩效,包括通用语言理解评价基准、回答数据集的SQuAD问题和41种语言的大型多语言网络数据集。