In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion's better generalization performance recently documented.
翻译:在深度学习使用的现代优化方法中,每次更新依赖于先前迭代的历史(通常称为记忆),且这种依赖性随着迭代步数向过去追溯而快速衰减。例如,带动量的梯度下降通过指数平均的过去梯度实现指数衰减的记忆。我们提出一种通用技术,用于识别能够近似具有记忆的优化算法的无记忆算法。该方法通过将更新中所有过去迭代点替换为当前迭代点,并添加由记忆产生的修正项(也是当前迭代点的函数)来实现。该修正项可解释为损失函数的扰动,其性质能够揭示记忆如何隐式地(反)正则化优化动态。作为我们理论的应用,我们发现Lion算法不具备AdamW算法中由记忆诱导的隐式反正则化特性,这为近期文献中记载的Lion算法更优的泛化性能提供了理论解释。