Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).
翻译:GPT-3和PALM等培训基础模型可能极其昂贵,往往涉及数万个连续运行数月的数万个GPU,这些模型通常在专门集群中受训,其特点是快速、单一的互连,使用精心设计的软件系统,支持数据平行和模型/管道平行。这类专门集群可能成本高,难以获得。我们能否利用更多分散、混杂和低带宽互连的计算法?以前对混杂、分散地研究、将重点放在相对较小的小型模型上,这些模型可以纯粹以数据平行的方式培训。Megatron等示范平行基础模型培训的先进计划只考虑同一数据中心的设置。在本文件中,我们介绍对大型基础模型的培训进行首次研究,在分散式的系统中将模型的平行化模式和低带宽度互连连的计算法运用到基础模型的培训中,将不同的计算“任务”分配到一组分散式的GPU装置,这些装置可以完全以数据平行的方式加以培训。我们提供了一个正式的成本模型,并进一步提议一个高效的进化算法,以便找到最优的、最先进的数据库中心化的数据中心。我们在不同的数据库中,我们用不同的模型进行广泛的实验,在不同的模型上进行不同的实验。