Normalization Layers (NLs) are widely used in modern deep-learning architectures. Despite their apparent simplicity, their effect on optimization is not yet fully understood. This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective. Concretely, the radial invariance of groups of parameters, such as filters for convolutional neural networks, allows to translate the optimization steps on the $L_2$ unit hypersphere. This formulation and the associated geometric interpretation shed new light on the training dynamics. Firstly, the first effective learning rate expression of Adam is derived. Then the demonstration that, in the presence of NLs, performing Stochastic Gradient Descent (SGD) alone is actually equivalent to a variant of Adam constrained to the unit hypersphere, stems from the framework. Finally, this analysis outlines phenomena that previous variants of Adam act on and their importance in the optimization process are experimentally validated.
翻译:普通化层(NLs) 被广泛用于现代深层学习结构。 尽管它们显然简单, 但它们对优化的影响尚未完全理解。 本文从几何角度引入了一个球形框架, 用于研究神经网络与NLs的优化。 具体地说, 一系列参数, 如进化神经网络的过滤器, 允许对单位超视量的优化步骤进行翻译。 此配方和相关的几何解释为培训动态提供了新的亮点。 首先, 生成了亚当的第一个有效的学习率表现。 然后, 演示显示, 在NLs 面前, 仅进行微量梯底部(SGD) 的演算就等同于 Adam 受单位超光谱限制的变体。 最后, 本分析概述了亚当先前的变体及其在优化过程中的重要性得到实验验证的现象 。