In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(\log(1/δ)\log T/\sqrt{T})$ or $O(\sqrt{\log(1/δ)/T})$ high-probability convergence rates for the final iterate, where T is the time horizon and δis the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noise. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noise.
翻译:过去几年中,随机梯度下降(SGD)算法的末点收敛性因其良好的实践表现与理论理解的缺乏而引发了广泛关注。对于利普希茨凸函数,已有不同工作针对最终迭代点建立了最优的 $O(\log(1/δ)\log T/\sqrt{T})$ 或 $O(\sqrt{\log(1/δ)/T})$ 高概率收敛速率,其中 T 为时间范围,δ 为失败概率。然而,为证明这些界,现有研究要么局限于紧致定义域,要么要求噪声几乎必然有界。一个自然的问题是:在不依赖这两个限制性假设的情况下,SGD 的末点迭代是否仍能保证最优收敛速率?除这一关键问题外,仍有大量理论问题尚未得到解答。例如,相较于非光滑问题中 SGD 的末点收敛性,目前针对光滑优化问题的研究成果尚少。此外,现有结果均局限于非复合目标函数及标准欧几里得范数。末点收敛性能否在理论上推广至更广泛的复合优化问题及非欧几里得范数,目前仍不明确。在本工作中,为应对上述问题,我们重新审视随机梯度方法的末点收敛性,首次提出一种统一框架来证明期望意义下及高概率意义下的收敛速率,该框架可同时处理一般定义域、复合目标函数、非欧几里得范数、利普希茨条件、光滑性及(强)凸性。此外,我们将分析拓展至重尾噪声下的末点收敛性证明。