A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
翻译:诸如批量正常化等深层次学习正常化技术的基本特性,正在使这些参数的内在领域成为单元领域,因此,它们的梯度优化动态可以通过球形优化和不同有效学习率(ELR)来体现,而以前曾研究过这一点。然而,不同的ELR可能会模糊内在损失地貌结构的某些特征。在这项工作中,我们利用固定的ELR直接调查实地培训规模变化性神经网络的特性。我们发现,这种培训的三个制度取决于ELR的价值:趋同、混乱平衡和差异。我们详细研究这些制度,既从理论上研究一个微小的例子,又从经验上透彻分析实际规模变化性深层次学习模式。每个制度都有独特的特点,并反映内在损失地貌的具体特性,其中一些特点与以前对常规和规模变化性神经网络培训的研究有很强的相似之处。最后,我们证明,所发现的制度如何反映在常规的标准化网络培训中,如何利用这些制度来实现更好的选取。