基于相对排序的神经语言模型缩放定律 (Relative-Based Scaling Law for Neural Language Models)

Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.

翻译：缩放定律旨在准确预测不同规模下的模型性能。现有缩放定律研究几乎完全依赖交叉熵作为评估指标。然而交叉熵仅提供了性能的部分视角：它衡量的是分配给正确标记的绝对概率，却忽略了正确与错误标记之间的相对排序。而相对排序对于语言模型至关重要，例如在贪婪采样场景中。为弥补这一局限性，我们从相对排序的视角研究缩放规律。首先提出基于相对排序的概率（RBP）指标，该指标量化了正确标记被排在前几位预测中的概率。基于此指标，我们建立了基于相对排序的缩放定律，用以刻画RBP如何随模型规模增大而提升。通过在四个数据集和跨越五个数量级的四个模型家族上进行大量实验，我们验证了该定律的鲁棒性与准确性。最后，我们通过两个示例说明该定律的广泛应用前景：为涌现现象提供更深入的解释，以及助力发现缩放定律的基础理论。总之，基于相对排序的缩放定律补充了交叉熵视角，有助于更完整地理解大语言模型的缩放规律，从而为实际开发和理论探索提供有价值的洞见。