One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
翻译:分散和联合学习的关键挑战之一是设计各种算法,有效处理各代理机构之间高度分散的数据分布。在本文中,我们根据数据异质性,重新审视了对数据异质性下流行的分散式小碎裂源演算法(D-SGD)的分析。我们展示了一个新的数量(称为邻里异质性)在D-SGD趋同率方面所发挥的关键作用。我们的分析结合了通信的地形学和异质性,揭示了这两个概念之间的不易理解的相互作用。然后我们争辩说,邻里异质性提供了一种自然标准,用来学习基于数据的、减少(甚至可以消除)数据异质性对数据异质性对D-SGD趋同时间的有害影响。关于标签Skew的重要分类案例,我们提出了这样一个问题:我们用Frank-Wolfe算法解决的可感性优化问题。从一套模拟和现实世界实验中可以看出,我们的方法提供了一种有原则的方法,用以设计一种稀散的地形学方法,平衡了DSGD数据趋同速度和每个数字。