Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers to wait for the gradient synchronization via a centralized parameter server or directly in decentralized workers. We present CrossoverScheduler, an algorithm that enables communication cycles of a distributed training application to be filled by other applications through pipelining communication and computation. With CrossoverScheduler, the running performance of distributed training can be significantly improved without sacrificing convergence rate and network accuracy. We achieve so by introducing Crossover Synchronization which allows multiple distributed deep learning applications to time-share the same GPU alternately. The prototype of CrossoverScheduler is built and integrated with Horovod. Experiments on a variety of distributed tasks show that CrossoverScheduler achieves 20% \times speedup for image classification tasks on ImageNet dataset.
翻译:分布式深层学习工作量包括对 GPU 群集的吞吐密集培训任务, 分散式蒸馏梯子(SGD)在后向后传播后会造成重大的通信延误, 迫使工人通过中央参数服务器等待梯度同步, 或者直接在分散的工人中进行。 我们提出 CrossoverScheduler 算法, 这个算法可以让分布式培训应用程序的通信周期通过管道通信和计算由其他应用程序填充。 与 CrossoverScheduler 一起, 分布式培训的运行表现可以大大改进, 同时又不牺牲聚合率和网络的准确性。 我们通过引入交叉同步化实现这一点, 允许多个分布式的深层学习应用程序来交替共享同一个 GPU 。 CrosoverScheduler 原型与Horovad 一起构建和整合。 在分布式任务上进行的实验显示, CrossoverScheduler 在图像网络数据集上完成图像分类任务的速度为20% 。