With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs for execution. Due to the large search space, the contemporary parallelization plan generators often rely on empirical rules that couple transformation and scheduling, and fall short in exploring more flexible schedules that yield better memory usage and compute efficiency. This tension can be exacerbated by the emerging models with increasing complexity in their structure and model size. SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. Such a principled approach decouples multiple seemingly intertwined factors and enables the composition of highly flexible parallelization plans. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup compared to state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN models like Swin-Transformer and AlphaFold2, as well as well-optimized models like GPT-3.
翻译:随着模型规模的扩大,深神经网络(DNN)越来越多地被培训成大型GPU加速器,这要求有一个适当的平行化计划,将DNN模型转化为细微的细细任务,然后将其排入到GPU执行。由于搜索空间的扩大,当代平行计划生成者往往依赖将转换和时间安排结合起来的经验规则,在探索更灵活的时间表以产生更好的记忆使用和计算效率方面做得不够。这种紧张状态可能由于正在形成的模型的结构和模型的复杂程度越来越高而加剧。超级软件是一个便利设计和生成高度灵活平行计划的系统。它将计划设计和生成明确分为三个相继阶段:模型转换、空间时间安排和数据依赖性保存。这种有原则的方法拆分多种看起来相互交织的因素,使得高度灵活的平行化计划得以组成。因此,超级仪不仅能够产生经验化的平行化计划,而且还可以建立新计划,达到3.5X的速度,而与诸如DeepSpeed、Megatron和Aloppa等最新技术解决方案相比,成为新兴的GNNFTRA型和类似的G-FROTRA模型。