We introduce SketchySGD, a stochastic quasi-Newton method that uses sketching to approximate the curvature of the loss function. Quasi-Newton methods are among the most effective algorithms in traditional optimization, where they converge much faster than first-order methods such as SGD. However, for contemporary deep learning, quasi-Newton methods are considered inferior to first-order methods like SGD and Adam owing to higher per-iteration complexity and fragility due to inexact gradients. SketchySGD circumvents these issues by a novel combination of subsampling, randomized low-rank approximation, and dynamic regularization. In the convex case, we show SketchySGD with a fixed stepsize converges to a small ball around the optimum at a faster rate than SGD for ill-conditioned problems. In the non-convex case, SketchySGD converges linearly under two additional assumptions, interpolation and the Polyak-Lojaciewicz condition, the latter of which holds with high probability for wide neural networks. Numerical experiments on image and tabular data demonstrate the improved reliability and speed of SketchySGD for deep learning, compared to standard optimizers such as SGD and Adam and existing quasi-Newton methods.
翻译:我们引入了SketsySGD(SketchySGD),这是一种使用草图来近似损失函数曲线曲线的准牛顿方法。Quasi-Newton方法是传统优化中最有效的算法之一,比SGD(SGD)等一级方法要快得多。然而,对于当代深层次学习而言,准牛顿方法被认为比SGD(SGD)和Adam(Adam)等一级方法低,因为不切实际的梯度导致的渗透复杂性和脆弱性较高。SketsySGD(SketsychySGD)绕过这些问题,将次抽样抽样、随机低级近似和动态正规化合在一起。在 convex案中,我们展示SketsychySGD(SketsychySGD)以比SGD(SGD)等一级方法最优化,速度比SGD(SGD)更快。在非康克斯(Sketsionx)案中,Sketsychychyal 实验和Sligard Sligard Statal 的当前标准数据展示了可靠性和SligardSGD(Sq)的改进速度。