Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.
翻译:去中心化训练移除了中心化服务器,成为一种通信高效的方法,能够显著提升训练效率,但其性能往往较中心化训练有所下降。多轮Gossip步骤(MGS)作为一种简单而有效的桥梁,连接了去中心化与中心化训练,显著缩小了实验性能差距。然而,其有效的理论原因以及MGS能否完全消除这一差距,仍是悬而未决的问题。本文通过稳定性分析推导了MGS的泛化误差与超额误差的上界,系统性地回答了这两个关键问题。1). 优化误差缩减:MGS以指数速率降低优化误差界,从而指数级收紧泛化误差界,使得算法能够收敛至更优解。2). 与中心化的差距:即使MGS趋于无穷大,相较于中心化小批量SGD(中心化为$\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$,去中心化为$\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$),其泛化误差仍存在不可忽略的差距。此外,我们在非凸设定下且无需有界梯度假设,首次提供了关于学习率、数据异质性、节点数量、单节点样本量以及通信拓扑等因素如何影响MGS泛化性能的统一分析,填补了去中心化训练领域的关键理论空白。最后,在CIFAR数据集上的实验有力地支持了我们的理论发现。