Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre-trained models, especially in the era of edge computing. In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or minor accuracy degradation on certain task(s). Our proposed method is also orthogonal to existing compact pre-trained language models such as DistilBERT using knowledge distillation, since a further 1.79x average compression rate can be achieved on top of DistilBERT with zero or minor accuracy degradation. It is suitable to deploy the final compressed model on resource-constrained edge devices.
翻译:经过事先培训的大型语言模型在许多自然语言处理(NLP)任务中日益显示出高度精准性。然而,硬件平台的重量储存和计算速度有限,阻碍了预先培训模型的普及,特别是在边缘计算时代。在这项工作中,我们提议使用硬件友好型块结构的切割,以高效变压器为基础的大型语言代表。我们把经过重新加权的Lasso组纳入块状结构的裁剪中,以便优化。除了大幅降低重量储存和计算之外,拟议方法还实现了高压缩率。关于通用语言理解评价(GLUE)基准任务的不同模型(BERT、RoBERTA和DistillBERT)的实验结果显示,我们在某些任务上达到了5.0x,零或轻微的精确度降低。我们提出的方法也与现有的经过精度精度精度精度的常规语言模型(例如使用知识蒸馏的DastilletBERT)不相近,因为在DTITERTERT的顶端可以实现1.79x平均压缩率,零或轻微的精确性降解。它适合于在资源控制边缘装置上部署最后的压缩模型。