The training of deep neural networks and other modern machine learning models usually consists in solving non-convex optimisation problems that are high-dimensional and subject to large-scale data. Here, momentum-based stochastic optimisation algorithms have become especially popular in recent years. The stochasticity arises from data subsampling which reduces computational cost. Moreover, both, momentum and stochasticity are supposed to help the algorithm to overcome local minimisers and, hopefully, converge globally. Theoretically, this combination of stochasticity and momentum is badly understood. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the particle movement by an underdamped dynamical system and the data subsampling through a stochastic switching of the dynamical system. In our analysis, we investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time: intuitively, the momentum helps to overcome local minimisers in the initial phase of the algorithm, but prohibits fast convergence to a global minimiser later. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and let the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In numerical experiments, we study our discretisation scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network to solve the CIFAR-10 image classification problem. Here, our algorithm reaches competitive results compared to stochastic gradient descent with momentum.
翻译:深心神经网络和其他现代机器学习模型的培训通常包括解决非电流优化问题,这些问题是高度的,需要大量的数据。在这里,基于动力的随机优化算法近年来变得特别流行。随机性产生于数据子抽样,这降低了计算成本。此外,动力和随机性都被认为有助于算法克服本地最小值,希望全球趋同。理论上,这种由离子蒸汽优化和动力驱动的结合非常不易理解。在这项工作中,我们提议并分析一个持续时间模型,用于利用动力来进行蒸汽梯位下降。这个模型是一个基于动力的分解性马尔多夫进程,它代表了粒子的动态系统的变化,通过动态系统的随机切换来减少计算成本成本成本成本成本。在我们的分析中,我们调查的是长期限,从下游至电流的振动变现变异性变异性变异性变异性变异性,我们特别感兴趣的是,我们最初的动力变异性变异性变异性变异性变异性变异性变异性系统。我们特别感兴趣的是,在时间变异性变异性变异性变异性变异性变异性变变变变变变变变的系统中将一个稳定性变异性变变变变变现到时间。