In recent years, the number of parameters of one deep learning (DL) model has been growing much faster than the growth of GPU memory space. People who are inaccessible to a large number of GPUs resort to heterogeneous training systems for storing model parameters in CPU memory. Existing heterogeneous systems are based on parallelization plans in the scope of the whole model. They apply a consistent parallel training method for all the operators in the computation. Therefore, engineers need to pay a huge effort to incorporate a new type of model parallelism and patch its compatibility with other parallelisms. For example, Mixture-of-Experts (MoE) is still incompatible with ZeRO-3 in Deepspeed. Also, current systems face efficiency problems on small scale, since they are designed and tuned for large-scale training. In this paper, we propose Elixir, a new parallel heterogeneous training system, which is designed for efficiency and flexibility. Elixir utilizes memory resources and computing resources of both GPU and CPU. For flexibility, Elixir generates parallelization plans in the granularity of operators. Any new type of model parallelism can be incorporated by assigning a parallel pattern to the operator. For efficiency, Elixir implements a hierarchical distributed memory management scheme to accelerate inter-GPU communications and CPU-GPU data transmissions. As a result, Elixir can train a 30B OPT model on an A100 with 40GB CUDA memory, meanwhile reaching 84% efficiency of Pytorch GPU training. With its super-linear scalability, the training efficiency becomes the same as Pytorch GPU training on multiple GPUs. Also, large MoE models can be trained 5.3x faster than dense models of the same size. Now Elixir is integrated into ColossalAI and is available on its main branch.
翻译:近年来,一个深层学习(DL)模型的参数数比GPU记忆空间的增长要快得多。许多GPU无法进入的人群使用各种培训系统来储存CPU内存的模型参数。现有的异质系统以整个模型范围内的平行计划为基础。它们对所有计算中的操作者都采用了一种一致的平行培训方法。因此,工程师需要付出巨大的努力,以纳入新型的模型平行和使其与其他平行空间相匹配。例如,Mix-Explorts(MOE)仍然与Eleprevel 的ZeRO-3不兼容。此外,当前系统面临小规模的效率问题,因为它们是设计并调整用于大规模培训的。在本文中,我们建议Elixir,一个新的平行培训系统是效率和灵活性的。Elixir,利用GPU和CPU的记忆资源和计算资源。关于灵活性,Elixir在操作员的颗粒体中生成平行计划。在Descreadal-FIROTA上,任何新型的CULA级培训模式都可以与HILO的高级数据传输模式一起进行。