We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.
翻译:我们研究一种可推广的稳健梯度梯度下降(RGD)技术的替代方法,这种技术在梯度可以按重速度调整时可以使用,尽管学习者对此并不了解。核心技术很简单:我们选择的候选方法不是试图在每一步都稳健地累计梯度,这种梯度成本高昂,导致风险圈的亚最佳维度依赖性,而是选择一个与大多数廉价的随机次工艺相去甚远,用于单传数据。除了正式的保证外,我们还根据亚高加索和重度数据提供实验性条件的稳健性经验分析。结果是一种简单、微不足道的平行程序,它保持REGD方法的正式强度,但比大得多到大学习问题。