We analyse and explain the increased generalisation performance \latestEdits{of} Iterate Averaging using a Gaussian Process perturbation model between the true and batch risk surface on the high dimensional quadratic. % Based on our theoretical results We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging with large learning rates and regularisation for improved regularisation (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well or better with iterate averaging than their non adaptive counterparts. Inspired by these results\latestEdits{, together with} empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. \latestEdits{These} give significantly better results than SGD, require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.
翻译:我们用高斯进程在高维四边形上真实和批量风险表面之间的扰动模型分析并解释增加的通用性能。%基于我们的理论结果,我们从理论结果中得出3个现象:}(1) 将平均偏差与高学习率和常规化相结合对于改进规范化的重要性(2) 平均偏差的理由。(3) 我们期望适应性梯度方法与非适应性强的平流法一样好或更好。受这些结果的启发\lastEdits{和}关于适当正规化对于解决方案多样性的重要性的经验性调查一起,我们提出了两种平均偏差的适应性算法。\latestEdits{sues}比SGD结果要好得多,需要较少的调整,不需要及早停止或验证成套的监测。我们展示了我们在各种现代和古典网络结构上的CIFAR-10-100、图像网和Penn Tribank数据集的功效。