In this paper, we propose a novel stochastic gradient estimator---ProbAbilistic Gradient Estimator (PAGE)---for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p$. We give a simple formula for the optimal choice of $p$. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finite-sum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak-\L{}ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.
翻译:在本文中,我们提出了一个用于非电解优化的新型随机梯度估计估计值(PAGE)- 概率- 概率- 精度渐变模拟器(PAGE)- 用于非电解优化。 PAGE很容易实施,因为它的设计是通过对香草 SGD的小规模调整而设计的:在每次循环中,PAGE使用香草迷你球 SGD更新,概率为美元,或以小幅调整方式使用香草迷你球 SGD 更新,计算成本低得多,概率为1 p美元。我们给出了一个用于最佳选择$p$的简单公式。我们证明,对于非电解问题(PAGE),我们对独立感兴趣的问题,我们证明,在限定和在线制度中,我们匹配了上限价和在线机制的上限,这证明PAGEA是一种最佳的方法。此外,我们证明,对于非电流函数符合Policak-Lojasiewicz(PL) 条件, 自动转换为更快的线性融合率。最后,我们进行了几次深层次的学习实验(例如,LeNet、 VGGGEGE、ResNet、ResNet) 也显示实际GAGAGAGALAD的精确性测试结果比实际的精确度要好得多。