Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://github.com/Youhe-Jiang/OptimalShardedDataParallel.
翻译:现有数据和模型平行方法利用模型复制和分割技术来支持分布式超大型模型的培训。然而,直接部署这些系统往往导致由于复杂的模型结构和严格的设备内存限制,培训效率低于最佳水平。在本文件中,我们提议采用最佳硬数据平行(OSDP)这一自动化平行培训系统,将数据和模型平行的优势结合起来。根据模型描述和设备信息,OSDP在存储消耗和硬件利用之间进行权衡,从而自动生成分布式计算图,并最大限度地扩大整个系统的吞吐量。此外,OSDP还引入操作员在培训可忽略不计的间接费用期间进行分裂,以进一步减少高峰的记忆足迹,从而使得大模型的可受训性以及更高的吞吐量。OSDP关于多种不同类型大型模型的广泛实验结果表明,拟议的战略在多个方面超越了最新技术。我们的代码可以在 https://github.com/Youhe-Jian/Optal查阅 https://github.