Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.
翻译:最近的变压器语言模型在许多自然语言处理(NLP)任务中取得了杰出的成果。然而,其巨大规模往往使它们在记忆限制装置上不切实际,要求实践者将其压缩到较小的网络中。在本文中,我们探讨了离线压缩方法,即不需要进一步微调压缩模型的计算便宜方法。我们通过提出一个新型的、业绩更好的自动编码器框架来挑战典型矩阵化方法。我们对我们的方法进行了全面化分析,在一系列不同的评价设置中考察了不同方面。此外,我们表明,通过将某些模块压缩在一起,使模块之间能够进行跨层的合作,对最后模型性能产生了积极影响。关于国家变压器的实验表明,我们的方法大大优于常用的离线性因子化法。