Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss.
翻译:培训大型神经网络(NN)模型需要大量的记忆资源,而激活压缩培训(ACT)是减少培训记忆足迹的一个很有希望的方法。本文介绍了GACT,这是一个ACT框架,用于支持有限领域知识的通用NT结构的广泛机器学习任务。通过分析ACT大约梯度的线性化版本,我们证明GACT在没有事先了解操作者类型或模型结构的情况下是趋同的。为了使培训稳定,我们建议一种算法,通过估计其对运行时梯度的影响来决定每个推力的压缩率。我们把GACT作为PyTorrch图书馆,很容易适用于任何NNC结构。GACT将革命性NP、变压器和图形NNN的激活记忆减少至8.1x,使培训达到4.2x至24.7x大批量,精确度损失微乎其微。