We conjecture that the reason for the difference in generalisation between adaptive and non adaptive gradient methods stems from the failure of adaptive methods to account for the greater levels of noise associated with flatter directions in their estimates of local curvature. This conjecture motivated by results in random matrix theory has implications for optimisation in both simple convex settings and deep neural networks. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We show that the numerical stability/damping constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in our neural network experiments. Finally, we show that other popular modifications to adaptive methods, such as decoupled weight decay and partial adaptivity can be shown to calibrate parameter updates to make better use of sharper, more reliable directions.
翻译:我们推测,适应性梯度方法与非适应性梯度方法之间一般化差异的原因是适应性方法未能考虑到与当地曲线估计方向相伴的较高噪音水平。随机矩阵理论的结果引起的这种推测对简单的二次曲线设置和深神经网络的优化都有影响。我们证明,适应性方法(数字稳定性低或阻力常数低)的典型时间表偏向于与尖锐方向相对的平面方向,有效扩大噪音对信号比率和损害一般化。我们表明,这些方法中使用的数值稳定性/振动常数可以分解成学习率的降低和估计曲线矩阵的线性缩缩。我们随后通过增加收缩系数,完全缩小神经网络实验中的通性差距,展示出显著的总体性改进。最后,我们表明,对适应性方法的其他流行性调整,如分解重力衰减和部分适应性调整,可以调整参数更新,以便更好使用更精确、更可靠的方向。