A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
翻译:在这项工作中,我们用固定的 ELR 直接调查了该领域培训规模变化神经网络的特性。我们根据 ELR 值发现了三种培训制度:趋同、混乱的平衡和差异。我们详细研究了这些制度,从理论上研究一个玩具的例子,对实际规模变化的深层次学习模式进行彻底的经验分析。每个制度都有独特的特点,反映了内在损失环境的具体特性,其中一些特点与以往关于常规和规模变化神经网络培训的研究有很强的相似之处。最后,我们展示了所发现的制度如何反映在常规的标准化网络培训中,以及如何利用这些制度实现更好的选择。