In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021}, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a scaled input tensor. To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans. Additionally, we provide an easy-to-use symbolic profiler that generates memory and computing statistics for any PyTorch model with a minimal time cost. Our approach allows users to parallelize their model training on the given hardware with minimum code change based. The source code is publicly available at Colossal-AI GitHub or https://github.com/hpcaitech/ColossalAI
翻译:近年来,大型模型展示了不同领域的最新业绩,然而,培训这类模型需要各种技术来解决诸如GPU等设备计算机功率和记忆有限的问题,一些常用技术包括管道平行、超平行和启动检查站;虽然现有工作的重点是寻找高效分布式执行计划(Zheng等人,2022年)和启动检查站时间表(Hermann等人,2019年,Beaumont等人,2021年}),但没有提出联合优化这两个计划的方法。此外,超时编集严重依赖准确的记忆和计算间接费用估算,而这往往耗时和误导性。现有的培训系统和机器学习管道要么实际执行每项操作,要么用一个缩放输入阀来估计记忆使用情况。为应对这些挑战,我们引入了一个能够共同优化分布式执行和梯度检查站计划(Hermann等人,2019年,Beaumont等人,2021年})的系统。此外,我们提供了一个简单的时间成本为任何PyTorch模型生成记忆和计算统计数据的简单工具。我们的方法使得用户能够将其关于给定的硬件的模型培训与基于最低代码/GAUB/GOAS的公开成本。