The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm. In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
翻译:人工神经网络功能的表示方式取决于非线性方式的大量参数,这些参数的适当参数是通过尽量减少“损失功能”来找到的,通常是通过随机梯度梯度下降或高级 SGD 算法来找到的。 在SGD的持续时间模型中,在“机械学习规模”之后有噪音的连续时间模型中,我们显示,在某种噪音制度中,优化算法偏向于目标函数的“膨胀”微量,其含义不同于以同一噪音来统一选择连续时间SGD的最短时间。