In this paper, we introduce Apollo, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix. Importantly, the update and storage of the diagonal approximation of Hessian is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory. To handle nonconvexity, we replace the Hessian with its rectified absolute value, which is guaranteed to be positive-definite. Experiments on three tasks of vision and language show that Apollo achieves significant improvements over other stochastic optimization methods, including SGD and variants of Adam, in term of both convergence speed and generalization performance. The implementation of the algorithm is available at https://github.com/XuezheMax/apollo.
翻译:在本文中,我们介绍阿波罗,这是非康韦克斯蒸汽优化的准纽顿法,它通过对角矩阵与赫森人相近,以动态方式将损失功能的曲线纳入其中。重要的是,黑森的对角近距离的更新和储存与具有适应性的第一阶优化方法一样有效,在时间和记忆上都具有线性复杂性。为了处理非混凝土,我们用被纠正的绝对值取代赫森,保证其绝对值为正定值。关于三种视觉和语言任务的实验表明,阿波罗在趋同速度和一般性性能方面,包括斯吉特和亚当的变体在内的其他对流优化方法都取得了显著的改进。 算法的实施可在https://github.com/XuezheMax/apollo上查阅。