Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the \textit{update} is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm up of the Adam optimizer, contrary to what is believed in the current literature.
翻译:适应性梯度方法在对深神经网络进行各种任务的培训方面取得了显著成功,然而,对于这一组方法的数学和统计特性了解不多。这项工作旨在提供一系列理论分析,分析其统计特性,以实验为根据。特别是,我们表明,当基梯度顺从正常分布时, \ textit{ update} 的大小差异是时间的日益增强和相互交错的功能, 并且没有差异。 这项工作表明,差异差异不是亚当优化剂需要暖和的原因, 与当前文献所相信的相反。