While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li_2}$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) \underline{\textbf{L}}azy learning, (II) \underline{\textbf{i}}ndependent feature learning and (III) \underline{\textbf{i}}nteractive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize. Thanks to lazy learning and weight decay, the \emph{backpropagated gradient} $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation \emph{independently}. Interestingly, the independent dynamics follows exactly the \emph{gradient ascent} of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals the underlying cause why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layer architectures.
翻译:尽管延迟泛化现象(即grokking)已被广泛研究,但对于复杂结构化输入,是否存在一个数学框架能够刻画何种特征将涌现、如何以及在何种条件下发生,并且与训练梯度动力学密切相关,这仍然是一个悬而未决的问题。我们提出了一个名为$\mathbf{Li_2}$的新框架,该框架捕捉了双层非线性网络grokking行为的三个关键阶段:(I)\underline{\textbf{惰性学习}},(II)\underline{\textbf{独立特征学习}}与(III)\underline{\textbf{交互特征学习}}。在惰性学习阶段,顶层过拟合于随机隐藏表示,模型表现出记忆现象。得益于惰性学习与权重衰减,来自顶层的\emph{反向传播梯度}$G_F$现在携带了目标标签信息,其特定结构使得每个隐藏节点能够\emph{独立地}学习其表示。有趣的是,该独立动力学过程精确遵循能量函数$E$的\emph{梯度上升}轨迹,而其局部极大值点正是涌现的特征。我们在群算术任务中研究了这些由局部最优诱导的特征是否可泛化、其表示能力如何,以及它们如何随样本量变化。当隐藏节点在学习后期开始交互时,我们可证明地展示了$G_F$如何转变为聚焦于需要学习的缺失特征。本研究阐明了权重衰减、学习率和样本量等关键超参数在grokking中的作用,推导出特征涌现、记忆与泛化的可证明缩放律,并从梯度动力学第一性原理揭示了近期优化器(如Muon)有效的根本原因。我们的分析可扩展至多层网络架构。