When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale's convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale's qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.
翻译:当使用大批量培训以加快随机梯度下降速度时,学习率必须适应新的批量规模,以便最大限度地加快速度并保持模型质量。 重新调整学习率需要大量资源, 而固定的缩放规则往往会降低模型质量。 我们提出Adascale SGD, 这是一种使学习率可靠地适应大批量培训的算法。 通过不断适应梯度差异, Adascale 自动实现广泛批量大小的加速。 我们正式描述这个质量, Adasuel 的集成约束, 它保持最终的目标值, 即使批量规模扩大, 迭代数减少 。 在实验性比较中, Adasasal 培训远远超出流行的“ 线性学习率缩放” 规则的批量限制。 这包括大型批量培训, 机器翻译、 图像分类、 对象检测和语音识别任务没有模型退化。 Adasuite 质量行为类似于“ 暖化” 超度行为, 但与暖化不同, 这种行为自然会从一个原则机制中产生。 算法将可计量的计算间接费用和新的超标度培训。