It has long been argued that minibatch stochastic gradient descent can generalize better than large batch gradient descent in deep neural networks. However recent papers have questioned this claim, arguing that this effect is simply a consequence of suboptimal hyperparameter tuning or insufficient compute budgets when the batch size is large. In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses. Our results confirm that the noise in stochastic gradients can enhance generalization. We study how the optimal learning rate schedule changes as the epoch budget grows, and we provide a theoretical account of our observations based on the stochastic differential equation perspective of SGD dynamics.
翻译:长期以来人们一直认为,小批量的梯度梯度下降比深神经网络中的大批量梯度下降要好,但最近的论文对这一说法提出质疑,认为这种效果仅仅是在批量大小大的情况下对超强参数进行亚优度调整或计算预算不足的结果。 在本文中,我们对一系列流行模型进行了精心设计的实验和严格的超光度参数扫描,这些模型证实小批量或中量大批量的体积可能大大超过测试集中大批量的体积。即使两个模型都为相同数量的迭代和大批量的体积接受了培训,培训损失也会减少。我们的结果证实,微小梯度梯度中的噪音可以加强概括化。我们研究了如何随着小行星预算的增长,最佳学习进度会发生变化,我们根据SGD动态的随机差异方程式角度对我们的观察进行了理论上的描述。