While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework to characterize what kind of features will emerge, how and in which conditions it happens from training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the \emph{backpropagated gradient} $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
翻译:尽管延迟泛化现象已被广泛研究,但对于复杂结构化输入,是否存在一个数学框架来刻画何种特征会涌现、如何涌现以及在何种训练条件下涌现,仍然是一个开放性问题。我们提出了一个名为$\mathbf{Li_2}$的新框架,用于捕捉双层非线性网络延迟泛化行为的三个关键阶段:(I)\underline{\textbf{惰性学习}},(II)\underline{\textbf{独立特征学习}}和(III)\underline{\textbf{交互特征学习}}。在惰性学习阶段,顶层网络对随机隐藏表示过拟合,模型表现出记忆现象。得益于惰性学习和权重衰减,来自顶层的\emph{反向传播梯度}$G_F$现在携带了目标标签的信息,并具有特定的结构,使得每个隐藏节点能够\emph{独立地}学习其表示。有趣的是,这种独立动态过程精确地遵循能量函数$E$的\emph{梯度上升},其局部最大值正是涌现的特征。我们在群算术任务中研究了这些由局部最优诱导的特征是否可泛化、其表示能力如何以及它们如何随样本量变化。当隐藏节点在学习的后期阶段开始交互时,我们可证明地展示了$G_F$如何变化以聚焦于需要学习的缺失特征。我们的研究阐明了权重衰减、学习率和样本量等关键超参数在延迟泛化中的作用,推导了记忆与泛化的可证明缩放规律,并从梯度动态的第一性原理揭示了近期优化器(如Muon)为何有效的根本原因。我们的分析可扩展至多层网络架构。