We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
翻译:我们考虑使用随机牛顿法,将光滑和强烈的 convex 目标功能最小化。 每次迭代时, 算法将获得对赫西安矩阵的随机估计值的异常值。 甲骨文模型包括了流行的算法, 如Subsampleded 牛顿和牛顿 Strach。 尽管使用第二顺序信息, 这些现有方法并不显示超线趋同, 除非在迭代期间, 随机噪音逐渐降低到零, 从而导致在渗透成本中进行计算扭曲。 我们提议用赫西安平均法来应对这一限制: 而不是使用最新的赫西安斯矩阵估计值, 我们的算法将所有过去的估计值都维持一个超值的平均值。 这在避免计算振荡时, 我们显示, 这个方案显示本地的Q- Q$ 超线趋近线趋同非随机率( oupilon\ sqrationration), 直径直达美元( t) 直径直径直线( t) 和正数 。