Momentum method has been used extensively in optimizers for deep learning. Recent studies show that distributed training through K-step averaging has many nice properties. We propose a momentum method for such model averaging approaches. At each individual learner level traditional stochastic gradient is applied. At the meta-level (global learner level), one momentum term is applied and we call it block momentum. We analyze the convergence and scaling properties of such momentum methods. Our experimental results show that block momentum not only accelerates training, but also achieves better results.
翻译:最近的研究显示,通过K级平均水平进行分布式培训具有许多良好的特性。我们为这种平均模式方法提出了一种动力方法。在每一个学习者一级都应用了传统的悬浮梯度。在元一级(全球学习者一级),应用了一个动力术语,我们称之为“阻塞动力”。我们分析了这种动力方法的趋同和缩放特性。我们的实验结果表明,阻塞动力不仅加快了培训,而且取得了更好的结果。