Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic expressions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.
翻译:了解培训期间神经网络参数的动态是建立深层学习理论基础的关键挑战之一。一个中心障碍是,高维参数空间网络的移动沿来自真实世界数据集的复杂随机梯度,沿复杂的随机梯度,采取离散的有限步骤。我们通过一个基于网络结构中内在的对称性的统一理论框架来绕过这一障碍,而这种结构对任何数据集都存在。我们表明,任何这种对称性都对梯度和海森人施加严格的几何限制,导致高维参数空间网络的运动在与诺埃瑟物理理论相似的连续时间限制中产生相关的保护法。我们进一步表明,实践中使用的有限学习率实际上可以打破这些对称性自然世界数据集。我们运用从有限差异方法获得改良梯度流的工具,这种差异方程式更接近SGD在有限学习率上采用的数字轨迹。我们把经修改的梯度流动与我们的对等值框架结合起来,以得出某些参数组合的精确整体表达方式。我们经过培训的亚瑟梯度,我们通过对各种动态分析后的研究,我们通过学习了各种动态分析性动态分析,可以展示我们关于各种动态的图态。