A fundamental open problem in deep learning theory is how to define and understand the stability of stochastic gradient descent (SGD) close to a fixed point. Conventional literature relies on the convergence of statistical moments, esp., the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and use the \textit{convergence in probability} condition to define the \textit{probabilistic stability} of SGD. The proposed stability directly answers a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of solutions that may overfit badly. To achieve this, we show that only under the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning, convergence to low-rank saddles, and correct learning. When applied to a neural network, these phase diagrams imply that SGD prefers low-rank saddles when the underlying gradient is noisy, thereby improving the learning performance. This result is in sharp contrast to the conventional wisdom that SGD prefers flatter minima to sharp ones, which we find insufficient to explain the experimental data. We also prove that the probabilistic stability of SGD can be quantified by the Lyapunov exponents of the SGD dynamics, which can easily be measured in practice. Our work potentially opens a new venue for addressing the fundamental question of how the learning algorithm affects the learning outcome in deep learning.
翻译:深度学习理论中的一个基本问题是如何在接近固定点时定义和理解随机梯度下降(SGD)的稳定性。传统的文献依赖于参数的统计矩(特别是方差)的收敛来量化稳定性。我们重新审视了SGD的稳定性定义,并使用“概率收敛”条件来定义SGD的“概率稳定性”。所提出的稳定性直接回答了深度学习理论中一个根本的问题:当神经网络有大量可能过拟合的解时,SGD如何选择有意义的解。为了实现这一目标,我们展示了只有在概率稳定性的视角下,SGD才表现出丰富且实用的学习阶段,例如完全失去稳定性、错误学习、收敛于低秩鞍点和正确学习等阶段。当应用于神经网络时,这些阶段图表意味着当梯度存在噪声时,SGD更喜欢低秩鞍点,从而提高学习性能。这个结果与传统的智慧相反,即SGD更喜欢平坦的最小值而不是陡峭的最小值,我们发现这种做法无法解释实验数据。我们还证明SGD的概率稳定性可以通过SGD动力学的李雅普诺夫指数来量化,这在实践中易于测量。我们的研究潜在地开辟了一个新的途径,用于解决深度学习中学习算法如何影响学习结果的基本问题。