最佳BERT外科:大语言模型的可缩放和精确的第二顺序 (The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models)

Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated with Transformers and SparseML, is available at https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT.

翻译：以变换器为基础的语言模型已成为自然语言处理的关键构件块。虽然这些模型非常精确, 但它们可能过于庞大, 且在计算上过于密集, 无法在标准部署中运行。各种压缩方法, 包括蒸馏、量度、结构化和非结构化的修剪方法, 已知可以降低模型大小, 提高发酵速度, 并降低精度损失。在这方面, 本文的贡献是两重。我们深入研究了 BERT 模型的准确性- 压缩交换。我们引入了最佳的 BERT Surgeon (oBERT), 一种基于大约二阶信息的高效和准确的重力理算方法。我们展示了两种语言任务阶段中最先进的结果: 训练前和微调速度。具体地说, oBERT将现有关于无结构二等调调调调的调试工作扩展了现有的工作, 允许对重量计数块进行调, 并在BERT 规模上应用。其次, 我们用精度的递缩C- 直流化器的精度- 直径直径的精度- 度- 度- 度- 直压- 度- 度- 度- 度- 度- 直压- 直压- 度- 直压- 直压- 直压- 直压- 直压- 直压- 直压- 直压- 直径比- 直压/ 直径径推- 直径- 直径- 直径- 至- 至- 直径- 直径- 直径- 直至至直至直至直至直至直至直至 10 10 等- 等- 等- 等- 10- 等- 10- 等- 直- 10- 等- 等- 10- 10- 等- 等- 10- 等- 等- 等- 等- 等- 等- 直- 直- 直- 直- 直- 直- 10- 直至直- 直- 10- 等- 等- 10- 等- 等-