Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance. Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients. However, as noted by Kingma and Ba (2015), any number of $L^p$ normalizations are possible, with the RMS corresponding to the specific case of $p=2$. In our work, we theoretically and empirically characterize the influence of different $L^p$ norms on adaptive gradient methods for the first time. We show mathematically how the choice of $p$ influences the size of the steps taken, while leaving other desirable properties unaffected. We evaluate Adam with various $L^p$ norms on a suite of deep learning benchmarks, and find that $p > 2$ consistently leads to improved learning speed and final performance. The choices of $p=3$ or $p=6$ also match or outperform state-of-the-art methods in all of our experiments.
翻译:亚当是一种适应性梯度方法,由于培训效绩迅速可靠,因此得到了广泛采用。最近的办法没有给亚当带来显著的改进,这往往是因为它们没有以其核心特征之一进行创新:最近梯度的根正平方(RMS)正常化;然而,如金玛和巴(2015年)所指出,任何数额的L ⁇ p美元都有可能实现正常化,而RMS与美元=2美元的具体情况相对应。在我们的工作中,我们在理论上和经验上首次对不同的美元标准对适应性梯度方法的影响进行了定性。我们从数学上展示了美元的选择如何影响所采取步骤的规模,同时对其他可取的属性不加影响。我们用一套深层学习基准中的各种L ⁇ p美元标准对亚当进行了评估,发现$ > 2美元始终可以提高学习速度和最终业绩。在我们的所有实验中,$p=3美元或p=6美元的选择也匹配或超越了最先进的方法。