Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.
翻译:以变异器为基础的语言模型应用于自然语言处理的广泛应用,但效率低,难以部署。近年来,提出了许多压缩算法,以提高目标硬件中大型变异器模型的实施效率。在这项工作中,我们提出了一个新方法,通过整合重量调整和模型蒸馏,培训少见的训练前变异器语言模型。这些稀疏的预先培训模型可用于为广泛的任务转移学习,同时保持其宽度模式。我们展示了我们的三个已知结构,以创造稀少的事先训练的BERT-Base、BERT-Large和DistillBERT。我们培训过的稀疏稀疏的变异器模型是如何将其知识转移到五种不同的下游自然语言任务中,而其精度损失最小。此外,我们展示了如何进一步将稀疏的变异模型的重量压缩到8位精度,使用昆虫化-敏度模式模式的微调校准模式。我们在SQUAD1.1和四比八位结构中,我们实现了40美元的压缩率比X,用于我们最佳的ARB-RB-RB的精确度, 显示我们的最佳BQ-RB-RB-RB-RB-B-B-B-C-B-C-C-C-B-B-C-C-C-C-C-C-C-RB-B-B-B-C-C-B-B-C-C-C-C-C-C-B-B-B-B-B-B-B-C-C-C-C-C-C-C-B-B-B-C-C-C-C-C-C-C-B-B-B-B-B-B-B-C-C-C-C-C-C-MA-MA-RE-MA-RE-MA-RE-RE-RE-RE-C-C-C-C-C-C-C-C-BAR-C-BAR-MA-L-LAR)的精度的精度的精度结果的精度的精度的精度的精度。