The convergence speed of machine learning models trained with Federated Learning is significantly affected by heterogeneous data partitions, even more so in a fully decentralized setting without a central server. In this paper, we show that the impact of label distribution skew, an important type of data heterogeneity, can be significantly reduced by carefully designing the underlying communication topology. We present D-Cliques, a novel topology that reduces gradient bias by grouping nodes in sparsely interconnected cliques such that the label distribution in a clique is representative of the global label distribution. We also show how to adapt the updates of decentralized SGD to obtain unbiased gradients and implement an effective momentum with D-Cliques. Our extensive empirical evaluation on MNIST and CIFAR10 demonstrates that our approach provides similar convergence speed as a fully-connected topology, which provides the best convergence in a data heterogeneous setting, with a significant reduction in the number of edges and messages. In a 1000-node topology, D-Cliques require 98% less edges and 96% less total messages, with further possible gains using a small-world topology across cliques.
翻译:与联邦学习联合会培训的机器学习模式的趋同速度受到多种数据分割的重大影响,在完全分散的环境下,没有中央服务器,更是这样。在本文中,我们表明,通过仔细设计基本的通信地形学,可以大大降低标签分布偏差(一种重要的数据异质类型)的影响。我们提出了D-Cliques,这是一种新型的地形学,它通过将节点组合成零星相联的晶片来减少梯度偏差,因此,在一个分区的标签分布代表了全球标签分布。我们还表明,如何调整分散的 SGD的更新,以获得无偏向梯度,并与D-Cliques形成有效的势头。我们对MNIST和CIFAR10的广泛经验评估表明,我们的方法提供了类似的趋同速度,作为一种完全相连的地形学,它提供了数据混合环境中的最佳趋同点,从而大大减少了边缘和信息的数量。在1000个诺德的表层学中,D-Cliqueques需要98%的边缘和96%的总信息,并可能利用小世界的层层层图进一步取得收益。