Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predict the performance of larger models, which enables effective model selection. However, revealing scaling laws requires careful hyperparameter tuning and multiple runs for the purpose of uncertainty estimation, which incurs additional overhead, partially offsetting the computational benefits.
翻译:以权力法的形式,确定模型参数计数与培训后业绩之间的可预测关系,但是,迄今为止,大多数研究尚未明确调查是否可以利用比例法加速模型开发。在这项工作中,我们从10K参数以下的模型开始,对范围广泛的语言理解任务进行经验调查,并评估9种语言理解任务的下游绩效。我们发现,在一些国家劳工规划任务中,比例法在微调时出现,在培训大型模型时也可以被用来调解趋同。此外,在有比例法时,可以用来预测较大模型的性能,从而能够有效地选择模型。然而,要显示比例法,就需要谨慎的超参数调整和多次运行,以便进行不确定性估计,这需要额外的间接费用,部分抵销计算的好处。