The test loss of well-trained neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents: super-classing image tasks does not change exponents, while changing input distribution (via changing datasets or adding noise) has a strong effect. We further explore the effect of architecture aspect ratio on scaling exponents.
翻译:受过良好训练的神经网络的测试损失往往遵循与培训数据集的大小或网络参数数目的精确的电法比例关系。我们提出了一个解释和连接这些比例法的理论。我们为数据集和模型大小确定差异限制和分辨率限制的缩放行为,共使用四个缩放制度。差异限制的缩放仅仅是因为存在一个井然有序的无限数据或无限宽度限制,而分辨率限制制度可以通过假设模型正在有效解决一个平稳的数据方方面面来解释。在大宽度限制中,可以从某些内核的频谱中获取等效的图像任务,我们提出的证据表明,大宽度和大型数据集分辨率限制缩放的缩放与双重性相关。我们展示了所有四个比例制度,在大型随机特性的控制设置和预设模型中,并用经验测试一系列标准架构和数据集的预测。我们还注意到数据集与缩放指数之间的若干经验关系:在超大宽度的宽度限制中,从某些内核图像任务中可以不作等同的缩放,而我们提出的证据表明,大宽度和大型数据集的宽度-宽度缩放缩放缩放比例的缩缩放率的缩放率与双重性是联系在一起的。我们在改变了输入结构的变动中,在不断变动的变动的变动的变动的图像。