We develop a new algorithm for non-convex stochastic optimization that finds an $\epsilon$-critical point in the optimal $O(\epsilon^{-3})$ stochastic gradient and hessian-vector product computations. Our algorithm uses Hessian-vector products to "correct" a bias term in the momentum of SGD with momentum. This leads to better gradient estimates in a manner analogous to variance reduction methods. In contrast to prior work, we do not require excessively large batch sizes (or indeed any restrictions at all on the batch size), and both our algorithm and its analysis are much simpler. We validate our results on a variety of large-scale deep learning benchmarks and architectures, where we see improvements over SGD and Adam.
翻译:我们开发了一种新的非convex 蒸汽优化算法, 在最佳的 $O (\\ epsilon} - - - - - - - - - - - - - 3} 计算中找到一个 $\ epsilon$- 关键点。 我们的算法使用Hessian- Victor 产品来“ 纠正” SGD 动力上的一个偏差术语。 这导致以与差异减少方法相似的方式进行更好的梯度估计。 与先前的工作相比, 我们并不要求过大批量尺寸( 或实际上对批量大小的任何限制 ), 我们的算法及其分析都非常简单。 我们验证了各种大型深层学习基准和结构的结果, 我们看到了在 SGD 和 Adam 上的改进 。