This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox deep learning algorithms. Our setup is the non-convex landscape optimization point of view, we consider a one time scale parametrization and we consider the situation where these algorithms may be used or not with mini-batches. We adopt the point of view of stochastic algorithms and establish the almost sure convergence of these methods when using a decreasing step-size point of view towards the set of critical points of the target function. With a mild extra assumption on the noise, we also obtain the convergence towards the set of minimizer of the function. Along our study, we also obtain a "convergence rate" of the methods, in the vein of the works of \cite{GhadimiLan}.
翻译:本文研究了在优化和机器学习中广泛使用的适应性算法的一些非现成特性, 其中有Adagrad和Rmsprop, 它们涉及大多数黑匣深层学习算法。 我们的设置是非convex景观优化观点, 我们考虑一个单一时间尺度的准美化, 我们考虑这些算法可以使用或不使用微型信箱的情况。 我们采纳了随机算法的观点, 当使用一个逐渐缩小的分级点对目标函数的临界点集使用时, 并确定了这些方法的几乎可以肯定的趋同性。 在对噪音稍作额外假设的情况下, 我们还获得了与最小化函数组合的趋同性。 在我们的研究中, 我们还获得了一种方法的“ 趋同率 ”, 在\cite{GhadimiLan} 的作品中, 我们得到了一种“ 一致率 ” 。