The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.
翻译:近年来在深层学习方面取得的显著进展在很大程度上是由规模的改进推动的,在规模的改进中,较大的模型在较大数据集方面接受了较长时间表的培训。为了从经验上预测规模的好处,我们主张采用基于外推损失的更严格的方法,而不是报告最合适的(内插)参数。然后我们提出一种从学习曲线可靠地估计法律尺度参数的食谱。我们表明,它比以往在多个领域,包括图像分类、神经机翻译和语言建模等一系列广泛的建筑家庭中,在图像分类、神经机翻译和语言建模方面,比以往的方法更精确地推断,此外还有BIG-Bench评估基准的任务。最后,我们发布了一个由90项评估任务组成的基准数据集,以促进这一领域的研究。