Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct real-world experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.
翻译:与同步随机梯度梯度下降(SGD)在 GPU 群集上进行分布式 KFAC (D-KFAC) 的分散培训已被广泛用于加速深层模型的培训过程。 然而, SGD 只在模型参数更新中使用第一阶梯度, 可能需要数日或数周时间。 最近的研究成功地利用了近似第二阶信息来加快培训过程, Kronecker- 受控的KFAC (KFAC) 是培训深层模型最高效的近距离算法之一。 然而, 当利用 GPU 群来利用分布式 KFAC (D- KFAC) 来培训模型时, 它需要大量计算并在每次迭代期间引入额外的通信。 在这项工作中,我们提议D- KFAC (SPD-K-KFAC) 与智能的计算和通信任务平行, 以缩短循环时间。 具体地说, 1,我们首先将 D- KFAC 的性能瓶颈描述为D- 和动态州调调调和 3) 我们开发了一种负负负负平衡配置, 在 GPU 10- PLACT 上显示我们G IM 的G 10- 机组列 的模型的模型的模型上。