Recently, several studies consider the stochastic optimization problem but in a heavy-tailed noise regime, i.e., the difference between the stochastic gradient and the true gradient is assumed to have a finite $p$-th moment (say being upper bounded by $\sigma^{p}$ for some $\sigma\geq0$) where $p\in(1,2]$, which not only generalizes the traditional finite variance assumption ($p=2$) but also has been observed in practice for several different tasks. Under this challenging assumption, lots of new progress has been made for either convex or nonconvex problems, however, most of which only consider smooth objectives. In contrast, people have not fully explored and well understood this problem when functions are nonsmooth. This paper aims to fill this crucial gap by providing a comprehensive analysis of stochastic nonsmooth convex optimization with heavy-tailed noises. We revisit a simple clipping-based algorithm, whereas, which is only proved to converge in expectation but under the additional strong convexity assumption. Under appropriate choices of parameters, for both convex and strongly convex functions, we not only establish the first high-probability rates but also give refined in-expectation bounds compared with existing works. Remarkably, all of our results are optimal (or nearly optimal up to logarithmic factors) with respect to the time horizon $T$ even when $T$ is unknown in advance. Additionally, we show how to make the algorithm parameter-free with respect to $\sigma$, in other words, the algorithm can still guarantee convergence without any prior knowledge of $\sigma$.
翻译:最近,一些研究考虑了重尾噪声条件下的随机优化问题,即将随机梯度与真实梯度之间的差异假设为具有有限的第$p$阶矩(例如被上界$\sigma^{p}$限制,其中$p\in(1,2]$),这不仅推广了传统的有限方差假设($p=2$),而且在不同的任务中经常被观察到。在这种具有挑战性的假设下,针对凸性和非凸性问题已经取得了不少新进展,但是,大多数仅考虑光滑目标。相比之下,在函数不光滑的情况下此问题尚未得到充分的探索和了解。本文旨在提供对于携带重尾噪声的随机非光滑凸优化的全面分析。我们重新审视了一个简单的基于截断的算法,其中仅被证明收敛于期望之中,但需要额外的强凸假设。在适当的参数选择下,对于凸和强凸函数,我们不仅建立了首个高概率速率,而且在期望方面相较现有研究给出了更精细的界限。显著的是,我们的所有结果对于时间跨度$T$都是最优的(或者几乎最优的,最多对数因子),即使$T$事先不知道。此外,我们展示了如何使算法与$\sigma$无关,也就是说,该算法可以在不需要任何关于$\sigma$的先验知识的情况下保证收敛。