This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(N^{-1}+m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(N^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+\lambda^{1+\alpha} + \phi_{\mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
翻译:本文研究了分散式随机梯度下降(D-SGD)的算法稳定性和可概括性。 我们证明D-SGD所学的协商一致模式是$\mathcal{O}(N ⁇ -1 ⁇ m ⁇ -1} ⁇ ⁇ lambda ⁇ 2}}}在非cavex非soot 环境下,在非cavex 非smooth 环境下,美元是总样本规模, 美元是美元, 美元是工人数字, 美元是测量通信表层连接的光谱差距。 这些结果随后提供了$mathcal{O}(N ⁇ -(1 ⁇ - ALpha)/2 ⁇ (m ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ (1 ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ ) ⁇ (1 ⁇ )-1 ⁇ )/ ⁇ }]美元,在非clamda美元关闭时, 与现有关于D-SG-SGD版本的文献显示,D-18的理论表明,D-S-S-S-S-S-S-S-SAR-S-S-S-SAR-S-S-S-S-S-SAR-S-S-SQ-SQ-SQ-SQ-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-S-SAR-S-S-S-S-S-S-SQ-SAR-SAR-SQ-SAR-SBAR-SBAR-SBAR-SB-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAR-SAR-SAR-SAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S