Momentum is a simple and popular technique in deep learning for gradient-based optimizers. We propose a decaying momentum (Demon) rule, motivated by decaying the total contribution of a gradient to all future updates. Applying Demon to Adam leads to significantly improved training, notably competitive to momentum SGD with learning rate decay, even in settings in which adaptive methods are typically non-competitive. Similarly, applying Demon to momentum SGD improves over momentum SGD with learning rate decay in most cases. Notably, Demon momentum SGD is observed to be significantly less sensitive to parameter tuning than momentum SGD with learning rate decay schedule, critical to training deep neural networks in practice.Results are demonstrated across a variety of settings and architectures, including image classification, generative models, and language models. Demon is trivial to implement, easy to tune, and incurs limited extra computational overhead, compared to the vanilla counterparts. Code is readily available.
翻译:运动动力是深层学习基于梯度的优化器的一种简单而流行的技术。 我们提出一种衰落的势头( 守护器) 规则, 其动机是降低梯度对未来所有更新的总贡献。 将魔方应用到亚当可以大大改进培训, 特别是通过学习率衰减, 特别是在适应性方法通常不具竞争力的环境中, 对动力SGD的竞争力。 同样, 将魔方运用于动力SGD, 使动力SGD在多数情况下会随着学习率衰减而改善。 值得注意的是, 魔方动力 SGD对参数调整的敏感度大大低于对学习速率衰减时间表的动力SGD。 这对于培养深神经网络至关重要。 各种环境和结构,包括图像分类、基因模型和语言模型都展示了成果。 与香草的对应方相比,实施、易调和产生有限的计算间接费用是微不足道的。 守则是现成的。