We conjecture that the reason for the difference in generalisation between adaptive and non-adaptive gradient methods stems from the failure of adaptive methods to account for the greater levels of noise associated with flatter directions in their estimates of local curvature. This conjecture motivated by results in random matrix theory has implications for optimisation in both simple convex settings and deep neural networks. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We show that the numerical stability/damping constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in our deep neural network experiments. Finally, we show that other popular modifications to adaptive methods, such as decoupled weight decay and partial adaptivity can be shown to calibrate parameter updates to make better use of sharper, more reliable directions.
翻译:我们推测适应性梯度方法与非适应性梯度方法之间一般化差异的原因是适应方法未能考虑到与当地曲线估计方向相伴的较高噪音水平。随机矩阵理论的结果引起的这种推测对简单的二次曲线设置和深神经网络的优化都有影响。我们表明,适应方法(数字稳定性低或阻力常数低)所使用的典型时间表偏向与尖锐方向相对的平坦方向,有效扩大噪声对信号比率和损害一般化。我们表明,这些方法中使用的数字稳定性/振动常数可以分解成学习率的降低和估计曲线矩阵的线性缩缩缩。我们然后通过增加缩微系数,完全缩小我们深神经网络实验中的通化差距,显示出显著的总体性改进。最后,我们表明,对适应方法的其他流行性修改,例如分解重力重力衰变弱和部分调整性,可以调整参数更新,以便更好使用更精确、更可靠的方向。