Thanks to their practical efficiency and random nature of the data, stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice, e.g., in several NLP tasks. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with H\"older-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.
翻译:由于数据的实际效率和随机性,随机第一阶方法是用于培训大型机器学习模型的标准。随机行为可能导致某种特定的算法运行,导致高度低于最优化的目标值,而理论保障通常被证明符合对目标值的期望。因此,在理论上保证算法提供少量客观剩余概率很高的数值至关重要。现有的非超超慢孔心优化方法具有复杂性,取决于信任度水平,这种信任度要么是负电量,要么是对数,但又假设是亚加西(浅尾)噪声分布在实践中可能无法维持的。在我们的论文中,我们解决这个问题并得出第一个高度概率趋同的结果,对非超低调的锥心线优化的可信度水平的依赖度有一定的限度。为了得出我们的结果,我们首先提出非新阶梯度规则,对于两种最优的梯度方法来说,我们两个最优级的阶梯度方法,最后的阶梯度分析都是我们最优的。