As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that directly predicts some metrics for large models solely based on the results and hyperparameters from small models. Existing methods based on scaling laws require hyperparameter search on the largest models, which is impractical with limited resources. We address this issue by presenting our discoveries indicating that Maximal Update parametrization (muP) enables accurate fitting of scaling laws for hyperparameters close to common loss basins, without any search. Thus, different models can be directly compared on large scales with loss prediction even before the training starts. We propose a new paradigm as a first step towards reliable academic research for any model scale without heavy computation. Code will be publicly available shortly.
翻译:随着语言模型的扩展,验证研究想法变得越来越昂贵,因为小模型上的结论不能轻易地转移到大模型上。一种可能的解决方案是建立一个通用系统,仅基于小模型的结果和超参数直接预测大模型的某些度量标准。现有的基于比例律的方法需要在最大的模型上进行超参数搜索,这在资源有限的情况下是不切实际的。 我们通过提出我们的发现,表明最大更新参数化(muP)可以在接近常见损失深渊的超参数的情况下,准确拟合比例律,从而解决了这个问题。 因此,不同的模型甚至在训练开始前就可以使用损失预测在大规模上进行直接比较。 我们提出了一种新的范式,作为实现大规模任何模型规模可靠学术研究的第一步,而无需进行重计算。 代码即将公开发布。