In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives, and incorporates a non-smooth regularization term for the better generalization ability. Decentralized stochastic proximal gradient (DSPG) method is commonly used to train this type of learning models, while the convergence rate is retarded by the variance of stochastic gradients. In this paper, we propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique. The basic idea is to introduce an estimator in each node, which tracks the local full gradient periodically, to correct the stochastic gradient at each iteration. By transforming our decentralized algorithm into a centralized inexact proximal gradient algorithm with variance reduction, and controlling the bounds of error sequences, we prove that DPSVRG converges at the rate of $O(1/T)$ for general convex objectives plus a non-smooth term with $T$ as the number of iterations, while DSPG converges at the rate $O(\frac{1}{\sqrt{T}})$. Our experiments on different applications, network topologies and learning models demonstrate that DPSVRG converges much faster than DSPG, and the loss function of DPSVRG decreases smoothly along with the training epochs.
翻译:在分散学习中,一个节点网络合作将总体目标功能(通常是其本地目标的有限和总和)最小化,并纳入一个非松动的正规化术语,以更好地普及能力。在培训这类学习模式时,通常使用分散式随机准梯度(DSPG)方法,而趋同率则因随机梯度差异而受阻碍。在本文中,我们提出了一个新奇算法,即DPSVRG,以利用差异减少技术加速分散式培训。基本想法是在每个节点引入一个估算器,定期跟踪本地全梯度,以纠正每个节点的随机梯度梯度。通过将我们分散式的算法转换成集中式非异性准梯度算法(DSPGG),并控制误差序列的界限。我们证明,DPSVRGGG为通用目标以$(1/T)的速率和以美元计为分数,同时DSPGPG定期跟踪,同时将DSPGG值与我们高级网络的速率(O\){FSNRGS) 和正统化模型的低度运行。