We theoretically analyzed the convergence behavior of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate improves from $O(T^{-1}+C)$ with a constant batch size to $O(T^{-1})$ with an increasing batch size, where $T$ denotes the total number of iterations and $C$ is a constant. Using principal component analysis and low-rank matrix completion, we investigated, both theoretically and numerically, how an increasing batch size affects computational time as quantified by stochastic first-order oracle (SFO) complexity. An increasing batch size was found to reduce the SFO complexity of RSGD. Furthermore, an increasing batch size was found to offer the advantages of both small and large constant batch sizes.
翻译:我们从理论上分析了黎曼随机梯度下降(RSGD)的收敛行为,发现使用递增的批量大小比使用恒定批量大小能带来更快的收敛,这不仅在使用恒定学习率时成立,在使用衰减学习率(如余弦退火衰减和多项式衰减)时也同样成立。收敛速率从恒定批量大小下的 $O(T^{-1}+C)$ 提升至递增批量大小下的 $O(T^{-1})$,其中 $T$ 表示总迭代次数,$C$ 为常数。通过主成分分析和低秩矩阵补全,我们从理论和数值上研究了递增批量大小如何影响以随机一阶预言机(SFO)复杂度量化的计算时间。研究发现,递增批量大小能够降低 RSGD 的 SFO 复杂度。此外,递增批量大小兼具小恒定批量大小和大恒定批量大小的优势。