One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called \emph{neighborhood heterogeneity}, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts in decentralized learning. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
翻译:分散化和联合化学习的关键挑战之一是设计算法,有效处理各代理机构之间高度多样化的数据分布。在本文中,我们根据数据异质性,重新审视对分散式沙粒渐变源算法(D-SGD)的分析。我们展示了一个新的数量(称为emph{邻里异质性)在D-SGD趋同率方面的关键作用。通过将通信地形学和异质性结合起来,我们的分析揭示了这两个概念在分散式学习中的不易理解的相互作用。然后我们争辩说,邻里异质性提供了一种自然标准,用来学习基于数据、减少(甚至可以消除)D-SGD趋同期数据异性的其他不利影响。关于与标签skew的分类的重要案例,我们提出了这样一个问题:我们用弗兰克-Wolfe算法来解决的这种好地表学问题,这是我们用一种容易理解的优化问题。从一套模拟的和实际的实验来看,我们的方法提供了一种自然标准,用来学习数据依赖数据的异质性,即减少(甚至可以消除)数据异质的趋(S-SG)在最高水平上设计一种数据趋同式数据平衡的方法。