We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $\sigma^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $\kappa$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowing $\sigma^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrt{\kappa}} \right) + \frac{\sigma^2}{T} \right)$ rate, without knowledge of $\sigma^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.
翻译:我们的目标是使振动梯度下降(SGD) 适应 (一) 振动梯度梯度中的噪音 $gma=2美元 和(二) 问题依赖常数。当以条件号为$\kappa美元将平滑、强的电流函数最小化时,我们证明,以飞速下降的步数和对平滑的了解,SGD的反复值将达到$\tilde{O} left(\ friend (\frac{-Thunkappa}\right) + 弗拉茨(D) gma2=2\\\T}\right) 基数的噪音,而不知道$\gma2美元。为了适应平滑滑度,我们用SLSS(通过上下下限) 显示SGD(SLS) 和SLS(下限) 的升步数,但是,我们证明,平坦坦的估算会接近D(现在) 方向, 水平将显示S-raldeal-ralde 的加速率,我们将显示S-ral-ral-ration 的加速率将显示为SGD。