Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its efficiency and remarkable ability to converge to global minimum remains shrouded in mystery. The loss function defined on a large network with large amount of data is known to be non-convex. However, relatively little has been explored about the behavior of loss function on individual batches. Remarkably, we show that for ResNet the loss for any fixed mini-batch when measured along side SGD trajectory appears to be accurately modeled by a quadratic function. In particular, a very low loss value can be reached in just one step of gradient descent with large enough learning rate. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch and how the learning rate affects the relationship between improvement on individual and full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a few steps apart, also significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
翻译:在几乎所有的深层学习应用中,沙粒梯度下降都起着根本作用。然而,其效率和与全球最小值相融合的惊人能力却被神秘地掩盖了。在拥有大量数据的大型网络上定义的损失函数已知为非隐形。然而,对于单个批量损失函数的行为,研究相对较少。值得注意的是,我们对ResNet而言,根据SGD侧轨轨迹测量的任何固定微型批次的损失似乎都以一个二次函数精确地模拟。特别是,在梯度下降的一步中,只要有足够的学习率,就能够达到非常低的损失值。我们提出了一个简单的简单模型和几何解释,以便能够分析细小孔和整批量数据之间的梯度关系,以及学习率如何影响个人和整批量改进之间的关系。我们的分析使我们能够发现,在SGD和具体学习率表轨迹上测量任何固定的微型批量的损失是否相等。特别是,在感官感地移动平均值(EMA)和Stochatical Weight Avering中,我们提出的模型甚至与我们观察到的模型相比,我们所观察到的“平均级”的图像模型结构结构结构也大大地比得更精确地改进。