Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.
翻译:尽管大型语言模型(LLMs)在许多任务上取得了最先进的性能,但其庞大的规模往往导致高昂的计算和环境成本,限制了其可访问性。参数高效微调(PEFT)方法通过减少可训练参数的数量,同时保持强大的下游性能,来应对这一挑战。尽管PEFT方法的开发日益增多,但当前的评估仍然有限(在评估模型和数据集方面)且难以复现。为了弥合这一差距,我们引入了PEFT-Bench,这是一个用于在自回归LLMs上评估多种PEFT方法的统一端到端基准测试。我们展示了其在27个NLP数据集和6种PEFT方法上的应用。为了考虑不同的PEFT训练和推理因素,我们还引入了PEFT软分数惩罚(PSCP)指标,该指标考虑了可训练参数、推理速度和训练内存使用情况。