Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
翻译:有大量参数的大型语言模型,如果在接近互联网规模的象征物数量上受过培训,就已被经验证明符合神经缩放法:具体地说,其性能在参数或数据集大小上可以预测为权力法,直到受到其他资源的制约。为了更好地了解这一点,我们首先确定允许产生这种缩放法的必要属性,然后提出统计模型 -- -- 一种联合基因化数据模型和随机特征模型 -- -- 以捕捉这种神经缩放人性模型。通过在大型培训数据集大小和参数数目众多的双重限度内解决这一模型,我们深入了解(一) 数据集和任务的统计数据结构和导致法律缩放的任务的统计结构,(二) 非线性地貌地图,例如由神经网络提供的地图,在就这些数据集进行训练时能够使法律缩放,(三) 训练成套和参数的装备计量尺度的最佳性能,以及(四) 这种缩放法能否崩溃,以及它们这样做时的行为方式。关键的调查结果是,当时在自然数据集统计中制定的权力法和任务法如何使自然数据缩放法的程度通过非线性损失率的平标图,如何将自然数据压压压压成了标准。