Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical benefits of RR have until very recently been eluding attempts at being satisfactorily explained using theory. Motivated by recent development due to Mishchenko, Khaled and Richt\'{a}rik (2020), in this work we provide the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems. First, we show that RR-SVRG converges linearly with the rate $\mathcal{O}(\kappa^{3/2})$ in the strongly-convex case, and can be improved further to $\mathcal{O}(\kappa)$ in the big data regime (when $n > \mathcal{O}(\kappa)$), where $\kappa$ is the condition number. This improves upon the previous best rate $\mathcal{O}(\kappa^2)$ known for a variance reduced RR method in the strongly-convex case due to Ying, Yuan and Sayed (2020). Second, we obtain the first sublinear rate for general convex problems. Third, we establish similar fast rates for Cyclic-SVRG and Shuffle-Once-SVRG. Finally, we develop and analyze a more general variance reduction scheme for RR, which allows for less frequent updates of the control variate. We corroborate our theoretical results with suitably chosen experiments on synthetic and real datasets.
翻译:几乎所有最先进的培训受监督的机器学习模式都几乎都是SGD的变体,这些变体通过一些额外的技巧,例如微型连接、动力和适应步骤等,来提升SGD的变体。在实际操作中非常有效,几乎所有广泛使用的机器学习软件都使用它作为默认值。然而,直到最近为止,RR的实际好处一直无法用理论来令人满意地解释。由于Mishchenko、Khaled和Richt\\{a}rik(202020)的最近发展,在这项工作中,我们首次分析了在随机校正(RRRR-SVRG)下的SVRRR(R-SVRG),用来解决一般限值问题。首先,RRRR-SRR(Ka) 与 $mathc(Ox) 常规变异变率(RRRR) 相比,可以更精确的变数(W) 更精确的变数(Ox) 。