We present a smoothly broken power law functional form (referred to by us as a Broken Neural Scaling Law (BNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, model input size, number of training steps, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, molecules, computer programming/coding, math word problems, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points (often called "emergent phase transitions") present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
翻译:我们提出了一种平滑的、破碎的幂律函数形式(我们称之为破碎的神经比例律(BNSL)),能够准确地对深度神经网络的扩展行为进行建模(即感兴趣的评估指标如何随着用于训练的计算量、模型参数的数量、训练数据集的大小、模型输入大小、训练步骤的数量或上游性能的变化而变化),并且针对大量多样化和不同任务中的各种架构(包括大规模视觉、语音、音频、视频、扩散、生成建模、多模态学习、对比学习、AI 对齐、机器人、分布外(OOD)泛化、持续学习、迁移学习、不确定性评估/校准、分布外检测、对抗鲁棒性、蒸馏、稀疏性、检索、量化、剪枝、分子、计算机编程/编码、数学问题、算术、无监督/自监督学习以及强化学习(单智能体和多智能体))以及这些任务在零-shot 、激励和微调设置下的各个方面,都能够准确模拟和外推其扩展行为。与神经比例律的其他函数形式相比,该函数形式在此集合上得出了更加准确的比例律外推结果。此外,该函数形式能够准确地对其他函数形式无法表示的比例律外推行为进行建模,例如双丘陵现象的比例律转化以及算术任务的比例律行为中延迟、锐利拐点(通常称为“新兴相变”)。最后,我们使用该函数形式来揭示比例律行为可预测性的极限。代码可在以下链接中找到:https://github.com/ethancaballero/broken_neural_scaling_laws