In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $\Omega(n+\frac{\sqrt{n}}{\epsilon^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $\Omega(b+\frac{\sqrt{b}}{\epsilon^2})$ for nonconvex online problems, where $b:= \min\{\frac{\sigma^2}{\epsilon^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{\epsilon^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{\epsilon^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-\L{}ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}{\epsilon})$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.
翻译:在本文中, 我们提出一个新颖的随机梯度估计值, 概率为 1 p_ t$。 我们给出了一个简单的公式, 用于最佳选择 $p_ t$ 。 此外, 我们证明, 通过对 Vanilla SGD 进行小的调整来设计 PAG 很容易执行它: 在每次循环中, PAG 使用香草迷你球 SGD 更新, 概率为 $_ t$ 或再用小调整, 计算成本要低得多, 概率为 1 p_ t$ 。 我们给出了一个简单的公式, 用于最佳选择 $p_ t$ 。 此外, 我们证明, 最短的 美元( safrac\ sqrt{ n eepslation2 sqlon2} 。 用于非cional- slationalational dislationalated $ (b\\ lax) lax lax max modeal disal deal deal a $ dies.