OSDP:分布式深层学习的最佳硬数据平行 (OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning)

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://github.com/Youhe-Jiang/OptimalShardedDataParallel.

翻译：现有数据和模型平行方法利用模型复制和分割技术来支持分布式超大型模型的培训。然而,直接部署这些系统往往导致由于复杂的模型结构和严格的设备内存限制,培训效率低于最佳水平。在本文件中,我们提议采用最佳硬数据平行(OSDP)这一自动化平行培训系统,将数据和模型平行的优势结合起来。根据模型描述和设备信息,OSDP在存储消耗和硬件利用之间进行权衡,从而自动生成分布式计算图,并最大限度地扩大整个系统的吞吐量。此外,OSDP还引入操作员在培训可忽略不计的间接费用期间进行分裂,以进一步减少高峰的记忆足迹,从而使得大模型的可受训性以及更高的吞吐量。OSDP关于多种不同类型大型模型的广泛实验结果表明,拟议的战略在多个方面超越了最新技术。我们的代码可以在 https://github.com/Youhe-Jian/Optal查阅 https://github.

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日