训练前测试统一语言模型排名 (Train-before-Test Harmonizes Language Model Rankings)

Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.

翻译：现有语言模型基准测试提供的模型排名相互矛盾，即使对于旨在评估相似技能的基准测试也是如此。这种排名冲突的困境阻碍了模型选择，使模型比较变得模糊，并为日益增长的竞争模型生态系统增添了混乱。本文提出了一种不同的模型比较视角：不依赖直接评估的开箱即用性能，而是通过为每个模型提供相同的基准特定微调后再进行评估来比较模型潜力。我们将这种方法称为训练前测试。我们的主要贡献是对24个基准测试和61个模型进行了全面的模型潜力实证评估。首先，我们证明通过训练前测试获得的模型潜力排名在所有基准测试中表现出显著的一致性。传统排名在直接评估下几乎不具备外部效度，而应用训练前测试后则获得了显著的外部效度：模型潜力排名能够优雅地从一个基准测试迁移到另一个基准测试。其次，训练前测试恢复了在直接评估中丢失的困惑度与下游任务性能之间的关联。值得注意的是，即使是基础模型的预微调困惑度也能预测微调后的下游性能，这表明排名一致性反映的是模型内在潜力而非微调伪影。最后，训练前测试将模型-分数矩阵简化为本质上的秩一矩阵，表明模型潜力主要由一个潜在因子主导，该因子通过训练前测试得以揭示。我们的研究支持将训练前测试作为大语言模型基准测试的默认组成部分。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日