Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.
翻译:最近的工作探索了大规模基于变压器的预先培训模型的潜力,特别是在自然语言处理中预先培训的语言模型(PLM)的潜力,这引起了各种观点的很多关切,例如财务成本和碳排放。用更快的推导和更廉价的部署来压缩像BERT这样的PLM,其性能损失微乎其微,引起人们的极大关注。在这项工作中,我们的目标是探索PLM的更大压缩比重,其中高压分解是一个潜在但调查不足的参数。还进一步提议了两个分解和重建协议,以提高压缩过程中的效益和效率。我们用${1}/{7}的变压器层压缩的BERT参数在正常运行,有时比GLUE原的BERT基准稍好一点。一个微小的版本用$ {1}/{48}美元实现BERT基准的96.7美元性能,其中电解析值(即小于2M参数,不包括嵌入层)和2.7\时间值的推算速度更快。为了显示拟议的方法是正对现行收益方法进行再研究的一种方法。