We propose the $\textit{Quantization Model}$ of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the $\textit{Quantization Hypothesis}$, where learned network capabilities are quantized into discrete chunks ($\textit{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model internals, we auto-discover diverse model capabilities (quanta) and find tentative evidence that the distribution over corresponding subproblems in the prediction of natural text is compatible with the power law predicted from the neural scaling exponent as predicted from our theory.
翻译:我们提出了 $\textit{量子化神经缩放模型}$,解释了模型和数据规模的观察到的功率法则损失下降,以及随着规模的增大新能力突然出现的现象。我们从我们称之为$\textit{量子化假设}$中推导出这个模型,即学习到的网络能力被量化为离散的$\textit{量子}$。我们证明了当量子按照使用频率递减的顺序学习时,使用频率的功率法则可以解释观察到的功率法则缩放损失。我们在玩具数据集上验证了这个预测,然后研究了大型语言模型的缩放曲线如何分解。利用语言模型内部结构,我们自动发现了各种模型能力($\textit{量子}$),并找到了暂定证据,表明对应子问题在自然文本预测中的分布与从我们的理论中预测的神经缩放指数预测的功率法则相容。