In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.
翻译:在最近提出的几个随机优化方法(如RMSProp、Adam、Adadelta)中,参数更新被平面过去梯度指数移动平均值反正平方根缩放。 保持这些单数第二移动估计器需要与参数数相等的内存。 对于神经网络重量矩阵,我们建议仅保持这些移动平均值的每行和每列总和,并根据这些数值估算每参数第二秒。 我们从经验上表明,这一方法产生的结果与基线相似。 其次, 当第二个加速器的衰变速度过慢时, 我们显示适应方法可以产生大于预期的更新。 我们提议更新剪动和逐步提高衰减率计划作为补救方法。 将这些方法与下降的势头结合起来, 我们取得与所公布的亚当制度相类似的成果, 用于培训2014 WMT 英文- 德文机器翻译模型的变压器模型, 同时在优化器中使用极小的辅助存储器。 最后, 我们提议根据参数本身的规模扩大参数更新参数。