In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at http://github.com/epfml/relaysgd.
翻译:在分散的机器学习中,工人对本地数据进行模型更新。由于工人只与少数没有中央协调的邻居进行交流,这些更新逐渐在网络中传播。这种模式使得在网络上进行分散的培训,而没有全部连接,有助于保护数据隐私,并降低数据中心分散培训的通信成本。主要在分散的深层次学习中,一个关键的挑战仍然是处理工人当地数据分配之间的差异。为了应对这一挑战,我们引入了在分散的学习中信息传播的RelaySum机制。RelaySum利用覆盖的树木将信息完全一致地传播给所有因节点之间的距离而有一定延迟的工人。相比之下,典型的八卦平均机制仅仅以单一的方式传播数据,而每步使用与RelaySum相同的通信量。我们证明,基于这个机制的RelaySGD与许多工人独立于数据杂交性和规模,从而能够对混杂数据进行高度分散的深度学习。我们的代码可在http://github.com/epfml/relaysgd查阅。