Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its ability to converge to a global minimum remains shrouded in mystery. In this paper we propose to study the behavior of the loss function on fixed mini-batches along SGD trajectories. We show that the loss function on a fixed batch appears to be remarkably convex-like. In particular for ResNet the loss for any fixed mini-batch can be accurately modeled by a quadratic function and a very low loss value can be reached in just one step of gradient descent with sufficiently large learning rate. We propose a simple model that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a many steps apart, significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
翻译:在几乎所有的深层学习应用中,沙粒梯度下坡都起着根本作用。 但是,它能够聚集到全球最小的最低水平, 被神秘地笼罩起来。 在本文中, 我们提议研究SGD轨迹沿SGD 轨迹的固定微型水桶上损失功能的行为。 我们显示, 固定批次的损失功能似乎非常相似。 特别是对于ResNet来说, 任何固定微型批次的损失可以通过二次函数来准确模拟, 并且非常低的损失值可以在梯度下坡的一步中达到, 且学习率足够高。 我们提出了一个简单的模型, 能够分析随机微型水槽和整批量的梯度之间的关系。 我们的分析让我们能够发现一个固定集和具体学习率时间表之间的等同性。 特别是, 对于任何固定的微型批次量平均移动( EMA) 和托卡斯蒂克 Weight Avering, 我们所拟议的模型可以精确地模拟在图像网络上观察到的训练轨迹。 我们的理论模型预测, 更简单的平均技术, 仅能平均两点, 我们的图像网基点, 大大地改进了我们的图像结构的精确度。