Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.
翻译:正常化技术是现代深层学习的赞美。 它们让权重更快地趋同, 并且往往会提高一般化性能。 据认为, 正常化导致的权重之间的偏差为梯度下降优化提供了有利条件: 有效的职级规模会随着时间而自动缩小, 稳定总体培训程序。 然而, 人们常常忽视, 在GD优化中进一步引入动力会更快地降低有效步数的大小, 而对于比例偏差的权重来说,这个现象还没有被研究过, 并且可能会在当前实践中造成不必要的副作用。 这是一个关键问题, 因为绝大多数现代深层神经网络网络的绝大多数都包含:(1) 基于动力的GD( 如 SGD 或 Adam) 和 (2) 规模不固定的参数。 在本文中,我们核实两种因素的广泛结合导致有效的职级大小和次最佳模型性表现的过早衰减。 我们提议了一个简单有效的分类、 SGDP 和 AdamP : 摆脱了对正读部分, 或正统的OD 水平 的校准性 方向, 因此, 最高级的GGGG 级 级 升级 。