Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.
翻译:大型数据集的深度神经网络培训通常可以通过多种计算节点加速。 这种方法被称为分布式培训,可以通过专用信息传输协议(如环全环-环环-环环-环-环-环-环-环-环-环-环)使用数百台计算机。 然而,大规模运行这些协议需要可靠的高速网络,而只有专门集群才能提供这种网络。相比之下,许多现实世界应用软件,如联合学习和云传播培训,都以不稳定网络带宽的不可靠设备运作。因此,这些应用软件仅限于使用参数服务器或八卦平均协议。在这项工作中,我们通过提出Mushpit All-Reduce(即一个与全球平均值成倍一致的迭接轨平均协议)来取消这一限制。我们展示了我们协议在分布优化方面的效率,并提供了强有力的理论保证。 实验显示,与竞争性八卦策略相比,图像网络Res-50培训速度为1.3x,在培训ALBERT从抓起时,用1.5x速度为1.5x速度。