This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(n^{-1}+ m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $n$ is the local sample size on each worker, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(n^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+ \lambda^{1+\alpha} + \phi_\mathcal{S})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.
翻译:本文研究了分散式随机梯度下降(D-SGD) 的算法稳定性和可概括性。 我们证明D-SGD所学的协商一致模式是$=mathcal{O}(n ⁇ -1 ⁇ m ⁇ _ ⁇ -1} ⁇ lambda}}}(在非cavex非悬浮环境中,美元是每个工人的本地样本规模,$美元是工人数量,1美元是1美元,而1美元是衡量通信表层连接度的光谱差距。这些结果随后提供了美元=mathcal{O}(n ⁇ (l ⁇ )/2 ⁇ (m ⁇ ) ⁇ (m ⁇ ) ⁇ (lääääää ⁇ )/2 ⁇ (lä ⁇ ) ⁇ (lä ⁇ ) ⁇ (lä ⁇ ) ⁇ (lä ⁇ ) ⁇ (m ⁇ ) ⁇ (lb ⁇ )-1 ⁇ (lmb}}}}在非conal- coloralizeral-alizal 绑定,即使$\lambda$已关闭了1美元, 与现有文献在D-SG-SG-SG-D的预测版本中建议, vandalalal-dalalalalalalalalalalalal-dalalalalalalalalalalalalalalalalal 工作可以确保D.