Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
翻译:在各种应用领域,变换模型已经取得了最先进的业绩,并逐渐成为先进的大型深层次学习模型的基础。然而,由于大量平行选择,如何以高效的方式培训这些模型以覆盖多个GPU系统仍具有挑战性。现有的DL系统要么依靠人工努力制定分布式培训计划,要么在一个非常有限的搜索空间内应用平行组合。在这种方法中,我们提议Galvatron,这是一个包含多个流行平行层面并自动找到最高效混合平行战略的新系统框架。为了更好地探索这样一个很少的庞大搜索空间,我们 1)涉及一个决策树,以某些合理的直觉为基础进行分解和运行,然后2)设计一个动态的编程搜索算法以产生最佳计划。对四个具有代表性的变换器工作量的评估表明,Galvatron可以用不同的GPU记忆预算自动进行分配培训。在所有被粉碎的假想中,Galvatron总是通过与以往有限的平行工作相比的高级系统。