Momentum is a popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD improves over momentum SGD with learning rate decay in most cases. Notably, Demon momentum SGD is observed to be significantly less sensitive to parameter tuning than momentum SGD with learning rate decay schedule, critical to training neural networks in practice. Results are demonstrated across a variety of settings and architectures, including image classification, generative models, and language models. Demon is easy to implement and tune, and incurs limited extra computational overhead, compared to the vanilla counterparts. Code is readily available.
翻译:运动动力是深层学习基于梯度的优化器的一种流行技术。 我们提出一种衰落的势头( 化石) 规则, 其动机是降低梯度对未来所有更新的总贡献。 将魔方应用到亚当可以大大改进培训, 特别是通过学习率衰减对SGD动力进行竞争, 即使在适应性方法通常不具有竞争力的环境中也是如此。 同样, 将魔方运用于动力SGD, 使动力SGD在多数情况下会随着学习率衰减而改善。 值得注意的是, 魔方动力 SGD对参数调控速度的敏感度大大低于对学习速率衰减时间表的动力SGD( SRD), 这对于培养神经网络在实际中至关重要。 成果在各种环境和结构( 包括图像分类、 基因模型 和语言模型 ) 中展示出来。 示范易于执行和调节, 并且与香草对口相比, 造成有限的计算间接费用。