GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from ~100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pre-training and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.
翻译:GPT是一种基于自动递进式变异器的预先训练语言模型,由于它在若干下游任务中表现最先进的,自然语言处理(NLP)领域引起了许多注意。GPT的成功主要归功于它对大量数据及其大量参数(从~100M到数十亿参数)的训练前训练。尽管GPT的优异性(特别是在几发或零发的设置中),但GPT的过度参数性质对于在计算力或记忆力有限的装置上部署这一模型来说可能非常令人望而却步。这个问题可以通过模型压缩技术来缓解;然而,压缩GPT模式模型在文献中并没有受到多少调查。在这项工作中,我们使用Kronecker脱钩来压缩GPT-22模型的线性绘图。我们的Kronecker GPT-2模型(KNGPT2)的初始性能(KPT-2模型)基于KPT-2模型的分解版本,然后对仅有少量的KPT2模型培训模型进行初步培训,而我们又使用G级的常规评估,最后是我们关于PT-G的模拟的常规评估。