In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distillation and model initialization to deliver higher accuracy on a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT for pruning during pre-training and finetuning. We find it less amenable to compression during fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTbase and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation
翻译:在本文中,我们介绍了一系列oBERTa语言模型,这是一组易于使用的语言模型,使自然语言处理(NLP)从业者可以获得介于3.8到24.3倍的更快的模型,而无需专业的模型压缩知识。具体而言,oBERTa扩展了现有的剪枝、知识蒸馏和量子化的工作,并利用冻结嵌入,改进了蒸馏和模型初始化,从而在广泛的转移任务上提高了精度。在生成oBERTa时,我们探讨了高度优化的RoBERTa在预训练和微调期间与BERT之间的差异,发现在微调期间不容易进行压缩。我们探索了oBERTa在七个代表性的NLP任务中的使用情况,并发现改进的压缩技术使得修剪的oBERTa模型可以与BERTbase的性能相匹配,并在SQUAD V1.1问答数据集上超过了Prune OFA Large的性能,尽管推理速度分别快了8倍和2倍。我们发布了我们的代码、训练规则和相关模型,以鼓励使用和实验。