We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as the Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme enjoys local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the iteration, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still enjoys a superlinear convergence~rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
翻译:我们考虑使用随机的牛顿法将平稳和强烈的 convex 目标功能最小化。 在每一次循环中, 算法将获得对赫瑟矩阵的随机估计值的奥秘访问。 先知模型包括了流行的算法, 如Subsampled 牛顿和牛顿 Scetch, 它可以有效地构建对许多任务的随机热量估计值。 尽管使用第二顺序信息, 这些现有方法不会显示超线趋同, 除非在循环中, 蒸馏的噪音会逐渐降低到零, 从而导致计算在一次循环中出现一次更快的递增。 我们用赫萨马斯特的平均值来解决这个问题: 而不是使用最新的赫斯坦和牛顿 Scetchet, 这样可以降低随机的噪音, 同时避免计算打击。 我们显示, 这个方案仍然拥有本地的 $Q- 的超线性趋同率, 在循环期间, 导致每次递增计算成本的计算结果在计算中, 将比奥氏的平流率值值值推算算算成一个比值。