DELTA: 动态优化超过 Tensor 校验的 GPU 内存 (DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation)

The further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain:1)The efficiency of recomputation is limited for both static and dynamic methods. 2)Swapping requires offloading parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA(Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we propose a filter algorithm to select the optimal tensors to be released out of GPU memory and present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04$\times$ maximum batchsize when training ResNet-50 and 2.25$\times$ when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.

翻译：深神经网络的进一步发展受到有限的 GPU 内存资源的限制。因此, 最优化 GPU 内存资源的要求非常高。交换和重新计算通常用于在深层学习中更好地利用 GPU 内存。然而, 作为一种新兴领域, 仍有若干挑战 :1 重算效率对于静态和动态方法来说都有限。 2 重算需要人工卸载参数, 这需要巨大的时间成本。 3 。在 DELTA 中, 没有这种动态和精细精细的转换方法, 需要将 Exderor 内存和 Excopult 40 重新转换。为了纠正上述问题, 我们提议一个名为 DELTA( 电动 Ensor Endor 关闭和重新配置) 的新型调度管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器操作, 最大50 里, 最高级的内存和最短时间规则转换成本, 后, 我们的变换算成本, 和再显示成本, 递后, 递转算成本, 递转算和再算法管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器管理器操作, 。,,,, 将成本,,, 将成本, 将成本, 将成本, 将成本重算, 重算, 重算, 。