Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.
翻译:散装DNN培训在当今生产组群中的分布式DSP(BSP)是分散式 DNN培训的脱facto范例(BSP) 。 但是,由于全球同步性,其绩效可能受到由静态地表层差异性或动态带宽争论造成的网络瓶颈的影响。 现有的解决方案,无论是系统一级优化加强BSP(例如环或高端全减),还是算法优化取代BSP(例如,放松全球壁垒的ASP或SSP),都无法完全解决问题,因为它们可能仍然受到通信效率不准确或风险趋同的不准确性的影响。 在本文中,我们展示了一个新的分而散式和散装同步的网络同步性同步(DS-S-yn-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Scll