Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning. We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we provide practical training guidelines and exemplify its effectiveness on the data-center setup as the important first step.
翻译:深层次学习模式的分散化培训有助于在网络上进行在线学习,并有效地推广到大型的计算集群。早期的实验显示,即使在数据中心设置中,分散化培训也往往由于模型质量的下降而受到影响:以分散化方式培训的模式的培训和测试绩效一般比以集中化方式培训的模式差,而这种绩效下降受到网络规模、通信地形学和数据分割等参数的影响。我们确定设备之间正在变化的共识距离是解释集中化培训与分散化培训之间差距的关键参数。我们从理论上表明,在培训共识距离低于临界数量时,分散化培训会与集中化的对应方一样快。我们从经验上证实,一般化绩效和协商一致距离之间的关系与这一理论观察是一致的。我们的经验见解使得能够有原则地设计更分散化的培训计划,从而减轻绩效下降。为此,我们提供了实用的培训指南,并把它在数据中心设置上的有效性作为重要的第一步。