Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention recently and, for strongly convex and smooth functions, it was shown to converge faster than SGD if 1) the stepsize is small, 2) the gradients are bounded, and 3) the number of epochs is large. We remove these 3 assumptions, improve the dependence on the condition number from $\kappa^2$ to $\kappa$ (resp. from $\kappa$ to $\sqrt{\kappa}$) and, in addition, show that RR has a different type of variance. We argue through theory and experiments that the new variance type gives an additional justification of the superior performance of RR. To go beyond strong convexity, we present several results for non-strongly convex and non-convex objectives. We show that in all cases, our theory improves upon existing literature. Finally, we prove fast convergence of the Shuffle-Once (SO) algorithm, which shuffles the data only once, at the beginning of the optimization process. Our theory for strongly-convex objectives tightly matches the known lower bounds for both RR and SO and substantiates the common practical heuristic of shuffling once or only a few times. As a byproduct of our analysis, we also get new results for the Incremental Gradient algorithm (IG), which does not shuffle the data at all.
翻译:随机重排( RR) 是使用迭代梯度下降步骤和数据重排的最小值函数最小化的算法 。 与此3个假设相比, RR通常在实际中速度较快, 且在 convex 和非 convex 优化中相当受欢迎。 最近RR的趋同率吸引了大量关注, 并且, 由于强烈的调和和平稳功能, 显示其趋同速度比 SGD 更快 :1 步调小, 2 梯度被捆绑, 3 步数很大。 我们删除了这3个假设, 提高了对条件数的依赖性从 $\ kappa =2 美元到 $\ kappa 美元。 RERRR 的趋同率最近吸引了相当多的关注, 并且显示RRR的增高性能比值稍小, 2 梯度被捆绑定 3 。 我们删除了这3 3, 我们提出了数个对条件的精确性变校正, 我们最后的算结果 。