Stochastic gradient descent (SGD), a widely used algorithm in deep-learning neural networks has attracted continuing studies for the theoretical principles behind its success. A recent work uncovered a generic inverse variance-flatness (IVF) relation between the variance of neural weights and the landscape flatness of loss function near solutions under SGD [Feng & Tu, PNAS 118,0027 (2021)]. To investigate this seemly violation of statistical principle, we deploy a stochastic decomposition to analyze the dynamical properties of SGD. The method constructs the true "energy" function which can be used by Boltzmann distribution. The new energy differs from the usual cost function and explains the IVF relation under SGD. We further verify the scaling relation identified in Feng's work. Our approach may bridge the gap between the classical statistical mechanics and the emerging discipline of artificial intelligence, with potential for better algorithm to the latter.
翻译:在深层学习神经网络中广泛使用的电磁梯度下降(SGD)这一在深层神经网络中广泛使用的算法吸引了对其成功背后的理论原则的持续研究。最近的一项工作发现,神经重量差异与SGD[Feng & Tu, PNAS 118,0027(2021 ) 下解决方案附近损失功能的景观平坦性之间,存在着一种普遍的反向反向膨胀化(IVF)关系。为了调查这一似乎违反统计原则的现象,我们运用了一种随机分解法来分析SGD的动态特性。该方法构建了Boltzmann分布可以使用的真正“能源”功能。新的能源与通常的成本功能不同,并解释了SGD下的IVF关系。我们进一步核实了冯氏工作中发现的缩放关系。我们的方法可以弥合经典统计力与人造智能正在形成的学科之间的差距,从而有可能向后者提供更好的算法。