As datasets and models become increasingly large, distributed training has become a necessary component to allow deep neural networks to train in reasonable amounts of time. However, distributed training can have substantial communication overhead that hinders its scalability. One strategy for reducing this overhead is to perform multiple unsynchronized SGD steps independently on each worker between synchronization steps, a technique known as local SGD. We conduct a comprehensive empirical study of local SGD and related methods on a large-scale image classification task. We find that performing local SGD comes at a price: lower communication costs (and thereby faster training) are accompanied by lower accuracy. This finding is in contrast from the smaller-scale experiments in prior work, suggesting that local SGD encounters challenges at scale. We further show that incorporating the slow momentum framework of Wang et al. (2020) consistently improves accuracy without requiring additional communication, hinting at future directions for potentially escaping this trade-off.
翻译:随着数据集和模型的日益扩大,分散培训已成为使深神经网络能够在合理时间范围内进行培训的必要组成部分,然而,分散培训可以产生大量的通信间接费用,从而妨碍其伸缩性。减少这一间接费用的战略之一是在同步步骤之间独立地对每个工人采取多重不同步的 SGD 步骤,这是一种称为本地 SGD 的技术。我们对于大规模图像分类任务的地方 SGD 和相关方法进行一项全面的经验性研究。我们发现,执行当地SGD 是有代价的:通信费用较低(因而培训速度更快),但准确性较低。这一结论与以往工作中规模较小的实验形成对照,表明当地SGD在规模上面临挑战。我们进一步表明,采用Wang et al. (2020年) 缓慢的势头框架,在不需要额外沟通的情况下,不断提高准确性,同时暗示今后可能摆脱这种交易的方向。