Recent advances in deep learning base on growing model sizes and the necessary scaling of compute power. Training such large-scale models requires an intricate combination of data-, operator-, and pipeline parallelism in complex distributed systems. We show how to use OneFlow's Split, Broadcast, and Partial Sum (SBP) tensor formulations to enable new distributed training methods with asymptotically optimal communication overheads. Using these insights, we develop AutoDDL, a distributed training framework that combines an exhaustive performance model and automated configuration search to find distributions with near-optimal communication overheads. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1\% and 10\% for Transformer and up to 17.7\% and 71.5\% for VGG on the two different systems, respectively.
翻译:根据不断增长的模型规模和计算能力的必要规模,在深层次学习基础方面最近的进展。培训这类大型模型需要在复杂的分布系统中将数据、操作员和管道平行结合起来。我们展示了如何使用OneFlow的分解、广播和部分总和(SBP)的发价配方,以便能够采用新的分布式培训方法,同时使用非临时最佳通信间接费用。我们利用这些洞察力开发AutoDDL,这是一个分布式培训框架,将详尽的性能模型和自动配置搜索结合起来,以找到接近最佳的通信间接费用的分布。我们使用不同的模型,包括VGG和变异器,对多节-Sing-GPU和多点-Multi-GPU的机器进行评估。与专家优化实施相比,AutoDDL分别将变压器的端对端培训时间缩短到31.1 ⁇ 和10 ⁇ 和17.7 ⁇ 和VGGG在两个不同系统中的对终端培训时间分别缩短至端。