The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning.
翻译:培训神经网络的大致成功方法是使用某些随机梯度梯度下降变量(SGD)来学习它们的重量。 在这里,我们表明,SGD发现的解决办法可以通过在学习后期将一组重量组合起来来进一步改进。在学习结束时,我们通过在重量空间中采用空间平均数获得一个单一模型。为了避免计算成本增加,我们调查了一组低维的晚阶段重量模型,这些模型与其余参数发生倍增效应。我们的结果显示,用晚阶段重量增强标准模型可以改进既定基准(如CIFAR-10/100、图像网和enwik8)的概括性。这些结果得到了对一个吵闹的二次曲线问题的理论分析的补充,它提供了神经网络学习后期阶段的简化图象。