We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.


翻译:我们提出AdaMuon,一种结合逐元素自适应性与正交更新的新型优化器,适用于大规模神经网络训练。AdaMuon包含两个紧密耦合的机制:(1)应用于正交化更新方向的逐元素二阶矩估计器,以及(2)符号稳定的正交更新,其中动量在正交化前先进行符号变换。这两个组件共同实现了方差自适应缩放,同时保持了稳定的更新几何结构。此外,AdaMuon采用RMS对齐的重缩放策略,使均方根更新幅度与Adam相匹配,从而无需额外调参即可直接复用现有的学习率调度方案。实验表明,AdaMuon不仅保持了稳定性,且在大规模场景下训练效率可超越Adam达40%以上。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员