Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.
翻译:预先培训的语言模型(PLM)在广泛的自然语言处理(NLPP)任务中取得了成功。但是,PLM的先进技术非常之大,可用于边缘设备。结果,模型压缩专题在NLP社区引起越来越多的注意。大多数现有工作侧重于压缩基于编码器模型(Tiny-BERT, DittillBERT, DistillBERTA等),然而,据我们所知,对基于脱coder的模型(例如GPT-2)的压缩工作没有做多少调查。我们的纸张旨在填补这一空白。具体地说,我们探索两个方向:1)我们使用目前最先进的知识蒸馏技术来改进DettleGPT-2的微调。 2)我们预先开发了一个使用层调和蒸馏法的压缩的GPT-2模型(DettleGPT2)。我们压缩模型的培训时间大大低于DettleGPT-2,但是在对下游任务进行精练时可以取得更好的业绩。我们还演示了对下游任务进行改进的数据的影响。