Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization to amortize their steep cost is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with a case study of training a Transformer-1T model on a cluster of variable compute, memory, and network resources. Our case study demonstrates COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters.
翻译:现代深层学习模式(DL)已发展到需要大量专门、高端节点集群才能培训的大小。设计这类集群以最大限度地提高性能和利用率,以摊销其高昂的成本是一项艰巨的任务,需要仔细平衡计算、记忆和网络资源。此外,每个模型的调试键盘的过多的调整对业绩产生巨大影响,最佳值往往取决于基本组群的特性,这需要复杂的集群-工作量共同设计过程。为了便利设计大型DL培训集群的空间探索,我们引入了综合集群设计方法和工作流程,以共同研究平行战略和关键集群资源提供对分布式DL培训绩效的影响。我们开发了一个逐步过程,以建立可重复和灵活的方法,并通过对可变式计算、记忆和网络资源组合组合的变式变式-T模型的培训进行案例研究来展示其应用情况。我们的案例研究表明,CONT在确定有希望的建筑优化方向和指导系统设计者在配置关键模型和群集参数方面的实用性。