Tail averaging improves on Polyak averaging's non-asymptotic behaviour by excluding a number of leading iterates of stochastic optimization from its calculations. In practice, with a finite number of optimization steps and a learning rate that cannot be annealed to zero, tail averaging can get much closer to a local minimum point of the training loss than either the individual iterates or the Polyak average. However, the number of leading iterates to ignore is an important hyperparameter, and starting averaging too early or too late leads to inefficient use of resources or suboptimal solutions. Our work focusses on improving generalization, which makes setting this hyperparameter even more difficult, especially in the presence of other hyperparameters and overfitting. Furthermore, before averaging starts, the loss is only weakly informative of the final performance, which makes early stopping unreliable. To alleviate these problems, we propose an anytime variant of tail averaging intended for improving generalization not pure optimization, that has no hyperparameters and approximates the optimal tail at all optimization steps. Our algorithm is based on two running averages with adaptive lengths bounded in terms of the optimal tail length, one of which achieves approximate optimality with some regularity. Requiring only the additional storage for two sets of weights and periodic evaluation of the loss, the proposed two-tailed averaging algorithm is a practical and widely applicable method for improving generalization.
翻译:Polyak 平均的不禁止性行为的平均改善程度Polyak 平均尾部在Polyak 平均非禁忌性行为上的平均改善程度,方法是将一些主要的重复式资源优化方法从计算中排除出来。在实践中,如果有数量有限的优化步骤和学习率不能被折射为零,那么,平均尾部可以大大接近培训损失的当地最低点,这比单个重复或Polyak 平均差差的地方差得多。然而,要忽略的领先迭代数是一个重要的超光度计,开始的过早或过晚,导致资源或次最佳解决办法的使用效率低下。我们的工作重点是改进一般化,这会使超常准度的设定更加困难,特别是在存在其他超常参数和超适量的学习率的情况下。此外,在平均开始之前,最后性能损失的信息度远比单个或Polyak平均值低得多,为了缓解这些问题,我们建议一个随时可以变式的尾部平均差,这不会导致所有优化步骤的最佳尾部的利用效率。我们的算法基于两个连续平均平均平均平均平均数,只有两个调整的长度的长度,而最接近于两种最精确的存储法。