Colousal-自动:大型模型平行和启动统一自动化检查点 (Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models)

In recent years, large-scale models have demonstrated state-of-the-art performance across various domains. However, training such models requires various techniques to address the problem of limited computing power and memory on devices such as GPUs. Some commonly used techniques include pipeline parallelism, tensor parallelism, and activation checkpointing. While existing works have focused on finding efficient distributed execution plans (Zheng et al. 2022) and activation checkpoint scheduling (Herrmann et al. 2019, Beaumont et al. 2021}, there has been no method proposed to optimize these two plans jointly. Moreover, ahead-of-time compilation relies heavily on accurate memory and computing overhead estimation, which is often time-consuming and misleading. Existing training systems and machine learning pipelines either physically execute each operand or estimate memory usage with a scaled input tensor. To address these challenges, we introduce a system that can jointly optimize distributed execution and gradient checkpointing plans. Additionally, we provide an easy-to-use symbolic profiler that generates memory and computing statistics for any PyTorch model with a minimal time cost. Our approach allows users to parallelize their model training on the given hardware with minimum code change based. The source code is publicly available at Colossal-AI GitHub or https://github.com/hpcaitech/ColossalAI

翻译：近年来,大型模型展示了不同领域的最新业绩,然而,培训这类模型需要各种技术来解决诸如GPU等设备计算机功率和记忆有限的问题,一些常用技术包括管道平行、超平行和启动检查站;虽然现有工作的重点是寻找高效分布式执行计划(Zheng等人,2022年)和启动检查站时间表(Hermann等人,2019年,Beaumont等人,2021年}),但没有提出联合优化这两个计划的方法。此外,超时编集严重依赖准确的记忆和计算间接费用估算,而这往往耗时和误导性。现有的培训系统和机器学习管道要么实际执行每项操作,要么用一个缩放输入阀来估计记忆使用情况。为应对这些挑战,我们引入了一个能够共同优化分布式执行和梯度检查站计划(Hermann等人,2019年,Beaumont等人,2021年})的系统。此外,我们提供了一个简单的时间成本为任何PyTorch模型生成记忆和计算统计数据的简单工具。我们的方法使得用户能够将其关于给定的硬件的模型培训与基于最低代码/GAUB/GOAS的公开成本。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

58+阅读 · 2020年1月25日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日