Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.
翻译:与香草梯度方法相比,适应性优化方法对于实现与香草梯度方法的高度趋同是众所周知的。 最优化的传统观点,特别是在康韦克斯优化中,解释这一改进的绩效的理由是,与香草梯度方案不同,适应性算法通过适应损失函数的全球几何来模仿二阶方法的行为。 我们认为,在神经网络优化的背景下,这种传统观点是不够的。 相反,我们主张进行本地轨迹分析。对于通过运行通用优化算法生成的循环轨迹,我们引入了${text{OPT ⁇ t{text{med ⁇ $,这是一个类似于损失条件数目的统计数据。 Hesian在河边评估过的情况。我们通过广泛的实验显示,亚当等适应性算法将常规轨迹偏向于“${text{Adam{text{med}}{med ⁇ $小区域,而人们可能期待更快的趋同。 相反,我们向那些区域引入了“SGDED偏向偏向偏向”的轨迹方法,我们用两种不同的理论推算法来证明我们这个更精确的网络的成功性的方法。 我们用这种简化的理论推算法来证明, 需要一种更精确的理论推论式的理论推算。