Large Language Models (LLMs) have achieved great success in solving difficult tasks across many domains, but such success comes with a high computation cost, and inference latency. As developers and third parties customize these models, the need to provide efficient inference has increased. Many efforts have attempted to reduce inference cost through model compression techniques such as pruning and distillation. However, these techniques either require labeled data, or are time-consuming as they require the compressed model to be retrained to regain accuracy. In this paper, we propose a gradient-free structured pruning framework that uses only unlabeled data. An evaluation on the GLUE and SQuAD benchmarks using BERT$_{BASE}$ and DistilBERT illustrates the effectiveness of the proposed approach. By only using the weights of the pre-trained model and unlabeled data, in a matter of a few minutes on a single GPU, up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
翻译:大型语言模型(LLMS)在解决许多领域困难任务方面取得了巨大成功,但这种成功伴随着高计算成本和推导时间。作为开发者和第三方定制这些模型,提供有效推论的必要性已经增加。许多努力试图通过模型压缩技术(如裁剪和蒸馏)来降低推论成本。然而,这些技术要么需要贴标签数据,要么耗时,因为需要重新训练压缩模型以恢复准确性。在本文中,我们提议一个不使用未标数据的无梯度结构调整框架。对GLUE和SQuAD基准的评估使用BERT$\\BASE}$和DuttreBERT来说明拟议方法的有效性。仅使用经过预先培训的模型的重量和未标数据,在单个GPU的几分钟内,就可以减少40%的原始FLOP计数的精度损失,在所考虑的所有任务中可以减少不到4%。</s>