The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.
翻译:在自然语言处理中采用基于变压器的模型(NLP),在使用大量参数后取得了巨大成功。然而,由于边缘装置的部署限制,人们越来越有兴趣压缩这些模型,以改善其推断时间和记忆足迹。本文提出了一个新的损失目标,即利用AutoEncoder结构,压缩以变压器为基础的模型中的象征性嵌入。更具体地说,我们强调压缩嵌入式方向对于原始未压缩嵌入器的重要性。拟议方法是任务性方法,不需要进一步的语言建模前训练。从最初语言模型来看,我们的方法大大超过了常用的基于SVD的矩阵化方法。此外,我们评估了我们关于SQuAD v1.1 数据集的拟议方法和GLUE基准的若干下游任务,我们在多数情况下也超越了基线。我们的代码是公开的。