We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results\latestEdits{, together with} empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.
翻译:我们用高斯进程扰动模型分析并解释高斯平流层中真实风险表面和批量风险表面之间在高维二次曲线上增加的通用性表现。我们从理论结果中得出三种现象:}(1) 将平均循环率(IA)与高学习率(IA)和常规化相结合对于改进规范化的重要性。(2) 平均频率较低的理由。(3) 我们期望适应性梯度方法同样或更好,其平均水平与其非适应性对应方相同。受这些结果的启发,Edits{与}关于适当规范化对于迭代国解决方案多样性的重要性的经验性调查一起,我们建议两种适应性算法,这些算法与偏差梯度梯度脱落(SGD)相比,效果要好得多,要求较少调整,不需要早期停止或验证成套监测。我们展示了我们在各种现代和古典网络架构上的做法的功效。