Checkpointing enables training deep learning models under restricted memory budgets by freeing intermediate activations from memory and recomputing them on demand. Previous checkpointing techniques statically plan these recomputations offline and assume static computation graphs. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and general, is parameterized by eviction policy, and supports dynamic models. We prove that DTR can train an $N$-layer linear feedforward network on an $\Omega(\sqrt{N})$ memory budget with only $\mathcal{O}(N)$ tensor operations. DTR closely matches the performance of optimal static checkpointing in simulated experiments. We incorporate a DTR prototype into PyTorch just by interposing on tensor allocations and operator calls and collecting lightweight metadata on tensors.
翻译:校验点允许在限制记忆预算下培训深层次学习模式, 方法是从记忆中释放中间激活, 并按需进行重新计算 。 先前的检查技术静态地计划这些离线的重计, 并假设静态计算图表 。 我们证明简单的在线算法可以通过引入动态天线再物质化( DTR) 实现可扩展和一般的贪婪在线算法( DTR ), 通过驱逐政策进行参数化, 并且支持动态模型 。 我们证明 DTR 可以使用$\\ Omega (\ sqrt{N}) 来培训一个以美元为单位的线性线性向前进网络, 仅使用$\ mathcal{O} (N) $ 的存储预算。 DTR 与模拟实验中最佳静态检查站的性能非常匹配。 我们将DTR 原型输入PyTorch, 仅仅通过对 Extor 分配和操作员的调用并收集 10 的轻度元数据 。