As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.
翻译:随着深度学习模型和输入数据的不断扩大,不可避免地要向分布式培训平台迈进,以适应模型并增加训练吞吐量。新兴的分布式训练系统积极采用尖端的方法和技术,例如晶圆节点、多维网络拓扑、分解内存系统和并行化策略。这导致了一个复杂的分布式培训软硬件共设计空间,需要建模或模拟基础设施来进行设计空间探索。本文扩展了开源ASTRA-sim基础设施,并赋予了模拟最先进和新兴的分布式培训模型和平台的能力。更具体地说,(i) 我们通过基于图形的训练循环实现支持任意模型并行化策略的ASTRA-sim,(ii) 我们实现了一个可参数化的多维异构拓扑生成基础设施,带有分析性能估计,能够模拟规模目标系统,并(iii) 我们增强了内存系统建模,支持精确建模网络内集体通信和分解内存系统。通过这些能力,我们运行了针对新兴分布式模型和平台的综合案例研究。此基础设施使系统设计人员能够快速遍历复杂的共设计空间,并在设计和部署大规模分布式培训平台时提供有意义的见解。