Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
翻译:有关像亚当这样深层学习的适应性梯度方法的培训动态知之甚少。 在本文中, 我们揭示了这些算法在全批和足够大批量环境中的行为。 具体地说, 我们从经验上证明, 在全批培训期间, 所先决条件的赫森人通常在一定数值上平衡最大值, 即梯度下游算法的稳定阈值。 对于具有职级规模$eta$1 = 0.9美元的亚当来说, 这个稳定阈值是38/\元。 在小型批量培训期间, 特别是批量规模增长期间, 这些算法的行为也会发生类似的影响。 然而, 尽管适应性方法在“ 稳定亚欧斯人” (AEOS) 培训时, 它们在这个制度下的行为与非适应性方法有很大不同。 虽然EoS 的不适应性算法无法进入高精度损失区, 但是, AEOS 的适应性梯度方法可以持续推进到高精度区域, 并且可以修正我们未来的学习方法。