Larger deep learning models usually lead to higher model quality with an ever-increasing GPU memory footprint. Although tensor checkpointing techniques have been proposed to enable training under a restricted GPU memory budget, the input tensor dynamics have been unexploited for optimizing performance while reducing GPU memory footprint. Specifically, due to the diverse datasets and subsequent data argumentation, the input tensor size per mini-batch is dynamic during the training process, leading to a changing GPU memory footprint. However, to leverage such input tensor dynamics in checkpointing, there are two challenges to be solved. First, the checkpointing plan needs to be determined during runtime due to the dynamics of input tensors. Second, the checkpointing plan needs to be applied on the fly without significantly deteriorating the performance. In this paper, we propose Mimose, an input-aware tensor checkpointing planner respecting the memory budget while enabling efficient model training on GPU. Mimose builds a lightweight but accurate prediction model of GPU memory usage online, without pre-analyzing the model. It generates a tensor checkpointing plan based on per-layer memory prediction and applies it to training progress on the fly. It also adopts a caching strategy to avoid having to regenerate the plan for repeated input size. Our experiments show that Mimose achieves superior training throughput compared to state-of-the-art memory planners under the same GPU memory budgets.
翻译:大型深层学习模型通常导致更高的模型质量,而GPU记忆足迹不断增加。尽管已经提出了在限制的GPU记忆预算下进行培训的高级检查技术,但投入强力动态尚未被开发用于优化性能,同时减少GPU记忆足迹。具体地说,由于数据集的多样性和随后的数据论证,每个小批的输入强力在培训过程中是动态的,同时导致GPU记忆足迹的变化。然而,为了在检查站利用这种输入强力动态,需要解决两个挑战。首先,由于输入强力阵列的动态,在运行期间需要确定检查计划。第二,在飞行中需要应用检查计划,而不会大大降低GPU的记忆足迹。在本文中,我们建议Mimose,在尊重记忆预算的同时,每个小批的输入体积强力强,同时能够对GPU的记忆的在线使用进行轻巧而准确的预测。它根据输入速度计划,需要根据输入的输入速度来决定。它产生一个基于每层记忆动力阵列的升级预算,同时将GMial的进度应用于我们的实验计划。