基于随机近端点方法的梯度中位数追踪 (Tracking the Median of Gradients with a Stochastic Proximal Point Method)

There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first derive iterative methods based on the stochastic proximal point method for computing the median gradient and generalizations thereof. Then we propose an algorithm estimating the median gradient across iterations, and find that several well known methods are particular cases of this framework. For instance, we observe that different forms of clipping allow to compute online estimators of the median of gradients, in contrast to (heavy-ball) momentum, which corresponds to an online estimator of the mean. Finally, we provide a theoretical framework for an algorithm computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise.

翻译：在随机优化的多个应用场景中，利用梯度的鲁棒估计能够带来显著优势。例如，在包含损坏节点的分布式学习、训练数据中存在大量异常值、隐私约束下的学习，甚至算法自身动态引起的重尾噪声等场景中。本文研究了基于中位数估计的鲁棒梯度估计器驱动的随机梯度下降（SGD）方法。我们首先推导了基于随机近端点方法的迭代算法，用于计算梯度中位数及其泛化形式。随后提出一种跨迭代估计梯度中位数的算法，并发现若干经典方法均可视为该框架的特例。例如，我们观察到不同形式的梯度裁剪可用于计算梯度的在线中位数估计器，而（动量法中的）动量则对应于均值的在线估计器。最后，我们为跨样本计算梯度中位数的算法建立了理论框架，并证明该方法即使在重尾且状态相关的噪声条件下仍能保持收敛性。