多样化环境基础模型分散化培训 (Decentralized Training of Foundation Models in Heterogeneous Environments)

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).

翻译：GPT-3和PALM等培训基础模型可能极其昂贵,往往涉及数万个连续运行数月的数万个GPU,这些模型通常在专门集群中受训,其特点是快速、单一的互连,使用精心设计的软件系统,支持数据平行和模型/管道平行。这类专门集群可能成本高,难以获得。我们能否利用更多分散、混杂和低带宽互连的计算法?以前对混杂、分散地研究、将重点放在相对较小的小型模型上,这些模型可以纯粹以数据平行的方式培训。Megatron等示范平行基础模型培训的先进计划只考虑同一数据中心的设置。在本文件中,我们介绍对大型基础模型的培训进行首次研究,在分散式的系统中将模型的平行化模式和低带宽度互连连的计算法运用到基础模型的培训中,将不同的计算“任务”分配到一组分散式的GPU装置,这些装置可以完全以数据平行的方式加以培训。我们提供了一个正式的成本模型,并进一步提议一个高效的进化算法,以便找到最优的、最先进的数据库中心化的数据中心。我们在不同的数据库中,我们用不同的模型进行广泛的实验,在不同的模型上进行不同的实验。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【开放书】卡耐基梅隆大学Elaine Shi 教授《Foundations of Distributed Consensus and Blockchains（分布式共识和区块链的基础）》150页pdf

专知会员服务

30+阅读 · 2022年2月22日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日