This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training. Unlike most deep learning compilers that focus on training or inference on a single device, DisCo optimizes a DNN model for distributed training over multiple GPU machines. Existing single-device compilation strategies do not work well in distributed training, due mainly to communication inefficiency that they incur. DisCo generates optimized, joint computation operator and communication tensor fusion strategies to enable highly efficient distributed training. A GNN-based simulator is built to effectively estimate per-iteration training time achieved by operator/tensor fusion candidates. A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time. We compare DisCo with existing DL fusion schemes and show that it achieves good training speed-up close to the ideal, full computation-communication overlap case.
翻译:本文提出Disco,这是数据平行分布式培训的自动深学习汇编模块。Disco与大多数侧重于单一设备培训或推断的深学习汇编者不同,Disco优化了多台GPU机器分布式培训的DNN模式。现有的单设备汇编战略在分布式培训中效果不佳,主要原因是它们产生的沟通效率低下。Disco生成了优化、联合计算操作员和通信聚合战略,以便能够高效分布式培训。GNN模拟器的建立是为了有效估计操作员/加速聚变候选人完成的渗透式培训时间。由模拟器驱动的回溯跟踪搜索算法,在大型战略空间中高效地导航,以确定良好的操作员/加速聚变战略,最大限度地减少分布式培训时间。我们比较Disco与现有的DL聚变计划,并表明它能够实现与理想、全面计算-通信重叠案例相近的良好培训速度。