oBERTa：通过改进初始化、蒸馏和剪枝机制来提高稀疏迁移学习的性能 (oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes)

In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models, which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings to improve knowledge distillation, and improved model initialization to deliver higher accuracy on a a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT with respect to pruning during pre-training and fine-tuning and find it less amenable to compression during fine-tuning. We explore the use of oBERTa on a broad seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTBASE and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively, faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation.

翻译：本文介绍了 oBERTa 语言模型的系列，这是一组易于使用的语言模型，使自然语言处理（NLP）专业人员能够在不需要模型压缩专业知识的情况下获得3.8至24.3倍的速度更快的模型。具体而言，oBERTa 扩展了现有的剪枝、知识蒸馏和量化工作，并利用冻结嵌入来改进知识蒸馏，利用改进的模型初始化，在广泛的迁移任务上提供更高的准确性。在生成 oBERTa 时，我们探索了高度优化的 RoBERTa 与 BERT 在预训练和微调期间剪枝方面的区别，并发现它在微调期间对压缩不敏感。我们在七个代表性的 NLP 任务上探索了使用 oBERTa 的方法，并发现改进的压缩技术使剪枝的 oBERTa 模型能够匹配 BERTBASE 的性能，并在 SQUAD V1.1 问答数据集上超过 Prune OFA Large，尽管在推理中比它们分别快8倍和2倍。我们发布了我们的代码、训练机制和相关模型，鼓励广泛使用和实验。