Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according to a per-component unweighted mean of all historical gradients and computes a bias-corrected momentum term jointly between the numerator and denominator. We prove that Expectigrad cannot diverge on every instance of the optimization problem known to cause Adam to diverge. We also establish a regret bound in the general stochastic nonconvex setting that suggests Expectigrad is less susceptible to gradient variance than existing methods are. Testing Expectigrad on several high-dimensional machine learning tasks, we find it often performs favorably to state-of-the-art methods with little hyperparameter tuning.
翻译:Adam 和 RMSProp 等许多流行的适应性梯度方法依靠指数移动平均值(EMA) 来使其梯度正常化。 虽然EMA 使这些方法对新的梯度信息反应高度, 但最近的研究表明, 这种方法还造成至少一个锥形优化问题的差异。 我们提议了一种叫Seffigrad的新方法, 它根据所有历史梯度的每个构件未加权平均值来调整梯度, 并同时计算分子和分母之间的偏差校正动动动动词。 我们证明, 期望分解无法在已知导致Adam偏离的优化问题的每个实例上产生分歧。 我们还在一般的随机非convex设置中确立了一种遗憾, 这表明Sexabigrad比现有方法更容易出现梯度差异。 我们发现, 期望分数测试几个高维的机器学习任务时, 通常会以微超参数的调整方式对状态方法进行优异的操作 。