Injecting artificial noise into gradient descent (GD) is commonly employed to improve the performance of machine learning models. Usually, uncorrelated noise is used in such perturbed gradient descent (PGD) methods. It is, however, not known if this is optimal or whether other types of noise could provide better generalization performance. In this paper, we zoom in on the problem of correlating the perturbations of consecutive PGD steps. We consider a variety of objective functions for which we find that GD with anticorrelated perturbations ("Anti-PGD") generalizes significantly better than GD and standard (uncorrelated) PGD. To support these experimental findings, we also derive a theoretical analysis that demonstrates that Anti-PGD moves to wider minima, while GD and PGD remain stuck in suboptimal regions or even diverge. This new connection between anticorrelated noise and generalization opens the field to novel ways to exploit noise for training machine learning models.
翻译:将人造噪音注入梯度下降(GD)通常用于改善机器学习模型的性能。通常,在这种受扰动的梯度下降(PGD)方法中使用不相干噪音。然而,尚不清楚这是最佳的还是其他类型的噪音能够提供更好的概括性性能。在本文中,我们放大了将连续PGD步骤的扰动联系起来的问题。我们考虑了各种客观功能,我们发现GD与抗声相关扰动(“Anti-PGD”)相比,一般化比GD和标准(不相干)PGD要好得多。为了支持这些实验性结果,我们还进行了理论分析,以证明反PGD移动到更广泛的迷你,而GD和PGD仍然停留在亚最佳区域,甚至存在差异。这种与抗声和一般化有关的新联系为实地打开了利用噪音培训机器学习模型的新途径。