In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and non-convex optimization with sub-Gaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. This method can be applied to the non-convex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1-\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. Additionally, we demonstrate that our techniques can be used to obtain high probability bound for AdaGrad-Norm (Ward et al., 2019) that removes the bounded gradients assumption from previous works. Furthermore, our technique for AdaGrad-Norm extends to the standard per-coordinate AdaGrad algorithm (Duchi et al., 2011), providing the first noise-adapted high probability convergence for AdaGrad.
翻译:在这项工作中, 我们描述一种通用方法, 以显示混凝土混凝土和非混凝土的高度概率来显示与亚加盟噪音的趋同性。 在先前的混凝土优化工作中, 趋同性要么只是预期的, 要么约束取决于域的直径。 相反, 我们表现出与界限的高度概率趋同性, 取决于初始距离至最佳解决方案的距离。 算法使用类似于标准设置的步数大小, 并且通用到 Lipschitz 函数、 平滑函数及其线性组合。 这个方法可以适用于非混凝土的情况。 我们展示的是 $( 1 QQQQQQ%% 2 ) /\ log (1/ delta) / T\ ) / Tqrt} / 约束值取决于域的直径。 当迭代数为 $O ( 1\ gma% 2) 和 $O (\ delta ) 的趋同率, 当 SGDGD 未知时, 和 SGD( la- lad) lad) ad ass ass be ass the ass the asserview lab lab ass the train train train train train holveals) subild laut the put the proislates ( laut the lig) licolviews) licolview) subolver sutions.</s>