The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
翻译:具有Polyak动力的随机梯度梯度下降法(SHB)被广泛用于神经网络培训,然而,尽管这种算法在实践中取得了显著的成功,但其理论定性仍然有限。在本文中,我们侧重于具有两层和三层的神经网络,并严格理解SHB发现解决方案的特性: \emph{(i)}在将部分神经元丢弃后的稳定, \emph{(ii)} 沿低损失路径连接, 和 \emph{(iii)} 与全球最佳融合。为实现这一目标,我们采取了中观,并将SHB动态与大网络宽度范围内的某种局部差异等同联系起来。这种中观激励了最近围绕SHB发现的解决办法的一线工作,而我们的文件则以动力来考虑一种算法。更具体地说,在证明存在限值差方程方程和独特性方程后,我们表现出全球最佳的趋同,并在中给出了中位域界限与SHB最后稳定性网络之间的定量约束。