The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.
翻译:缩放定律作为大语言模型发展的基石,预测了模型性能随计算资源增加而提升的规律。然而,尽管该定律已得到实证验证,其理论基础仍不甚明晰。本研究将基于Transformer的语言模型的学习动力学形式化为一个常微分方程系统,进而将该过程近似为核行为。与先前基于简化模型的分析不同,我们严格分析了多层Transformer在任意数据分布的序列到序列数据上进行随机梯度下降训练的过程,这更贴近真实场景。我们的分析刻画了泛化误差随计算资源与数据规模同步扩展时(尤其是在优化过程中)收敛到不可约风险的特性。我们建立了一个以明显相变为特征的超额风险理论上界。在初始优化阶段,超额风险相对于计算成本${\sf C}$呈指数衰减。然而,一旦超过特定的资源分配阈值,系统便进入统计阶段,此时泛化误差遵循$Θ(\mathsf{C}^{-1/6})$的幂律衰减。除了这一统一框架,我们的理论还分别推导了模型规模、训练时间和数据集大小的独立缩放定律,阐明了每个变量如何独立地主导泛化误差的上界。