Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect of optimization algorithms. A remarkable progress is the work (Lyu&Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. However, theoretical guarantee for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural networks. We prove that adaptive algorithms that adopt exponential moving average strategy in conditioner (such as Adam and RMSProp) can maximize the margin of the neural network, while AdaGrad that directly sums historical squared gradients in conditioner can not. It indicates superiority on generalization of exponential moving average strategy in the design of the conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel adaptive gradient flow and surrogate margin. Our experiments can well support the theoretical findings on convergent direction of adaptive optimization algorithms.
翻译:尽管具有超能力,但经过特定优化算法培训的深层神经网络尽管具有巨大的超能力,却往往能够向隐蔽的数据推广。最近,研究人员通过调查优化算法的隐含正规化效应来解释这一点。一个显著的进步是工作(Lyu&Li, 2019),它证明梯度下降(GD)最大限度地扩大了同质深神经网络的边缘。除GD外,AdaGrad、RMSProp和Adam等适应性算法因其快速培训过程而广受欢迎。然而,适应性优化算法普遍化的理论保障仍然缺乏。在本文件中,当他们正在将同类深神经网络的后勤损失最大化时,我们研究了适应性优化算法的隐含的正规化。我们证明,在条件器(例如Adam和RMSProp)中采用指数移动平均战略的适应性下降幅度可以最大化神经网络的边缘值。而AdaGrad直接将历史的平方梯度梯度指数平均战略在条件设计中具有优势。在技术上,我们提供了一个统一的框架,通过建立新的适应性调整性升级的升级的轨迹模型的实验,来分析适应性调整的轨迹。