Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradient-quantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.
翻译:许多现行努力中采用的缓解这一问题的简单方法,是利用一个不变的平均时期,进行每一次交流。在本文中,我们表明,在趋同和通信成本方面,最佳平均周期不是固定的,而是在执行过程中有所不同的。具体地说,我们发现,降低计算节点之间的模型参数差异对于定期参数平均SGD的趋同至关重要。鉴于固定的通信预算,我们表明,更有利于早期同步,以降低初始的大差异,在培训进程的后期更频繁地同步。我们建议采用实用的算法,即与SGD(ADPGGD)相比,使用ADPB周期指数的更快周期参数,而不是在执行过程中的不同。我们发现,降低计算节点之间的模型参数差异对于定期参数平均SGD的趋同关系至关重要。我们发现,更有利于早期同步,以降低最初的大差异,在培训进程的后期阶段,减少同步。我们建议采用现实的算法,与SBDGG:与SGD(APG)之间的升级,我们与S(SGD)之间实现更小的完全的对比,因此,我们与SGD(SBG)之间的定期的连接比,我们与SGD)相比,我们与SBAGD(我们更接近,我们与S)比的更接近。