In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD) which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
翻译:鉴于香草SGD的直接和简单改进,本文件将细微调整其在微型批量案例中的步数。 为此,根据当地二次模型,并仅使用噪音梯度近似值,对曲线进行估计。一个人获得了一种新的随机第一顺序方法(Step-Turned SGD),可被视为古典Barzilai-Borwein方法的随机版本。我们的理论结果确保几乎可以肯定地与关键集相融合,我们提供趋同率。深残余网络培训实验显示了我们方法的有利特性。对于这些网络,我们在培训期间观察到,损失突然下降,中阶段测试精度提高,结果比SGD、RMSprop或ADAM好。