使用批量更新和/或近似梯度的动量重球法的收敛性 (Convergence of Momentum-Based Heavy Ball Method with Batch Updating and/or Approximate Gradients)

In this paper, we study the well-known "Heavy Ball" method for convex and nonconvex optimization introduced by Polyak in 1964, and establish its convergence under a variety of situations. Traditionally, most algorthms use "full-coordinate update," that is, at each step, very component of the argument is updated. However, when the dimension of the argument is very high, it is more efficient to update some but not all components of the argument at each iteration. We refer to this as "batch updating" in this paper. When gradient-based algorithms are used together with batch updating, in principle it is sufficient to compute only those components of the gradient for which the argument is to be updated. However, if a method such as back propagation is used to compute these components, computing only some components of gradient does not offer much savings over computing the entire gradient. Therefore, to achieve a noticeable reduction in CPU usage at each step, one can use first-order differences to approximate the gradient. The resulting estimates are biased, and also have unbounded variance. Thus some delicate analysis is required to ensure that the HB algorithm converge when batch updating is used instead of full-coordinate updating, and/or approximate gradients are used instead of true gradients. In this paper, we not only establish the almost sure convergence of the iterations to the stationary point(s) of the objective function, but also derive upper bounds on the rate of convergence. To the best of our knowledge, there is no other paper that combines all of these features.

翻译：在本文中，我们研究了Polyak在1964年引入的用于凸优化和非凸优化的著名“重球”方法，并建立了其在多种情况下的收敛性。传统上，大多数算法使用“全坐标更新”，即在每个步骤中，更新参数的每个组分。然而，当参数的维度非常高时，每次迭代更新一些但不是所有的参数组分更加高效。我们在本文中将其称为“批处理更新”。当梯度基本算法与批处理更新一起使用时，原则上仅计算要更新参数的梯度组分即可。但是，如果使用反向传播等方法来计算这些组分，则仅计算梯度的一些组分与计算整个梯度相比并不能提供太多节省的CPU使用。因此，为了在每个步骤中实现明显的CPU使用减少，可以使用一阶差分来近似梯度。由此得出的估计存在偏差，并且具有无界的方差。因此，需要进行一些细致的分析来确保在使用批量更新代替完全坐标更新和/或近似梯度而非真实梯度时，HB算法收敛。在本文中，我们不仅确定了迭代对目标函数的静态点的几乎确定收敛性，而且还推导出收敛速率的上限。据我们所知，没有其他论文结合了所有这些特征。