As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.
翻译:随着深层次学习模式规模的扩大和不断扩大,培训需要更长的时间和资源,使错误容忍度越来越重要。现有的先进方法,如CriggFreq和Elastic Horovod,需要备份一个模型状态(即参数和优化状态)的复制件,用于记忆中,对于大型模型来说,这成本很高,并导致非三重性间接费用。本文介绍了SWIFT,这是用于分布式深度神经网络培训的新型恢复设计,它大大降低了故障恢复的间接费用,而不会影响培训通过量和模型准确性。SWIFT不是要制作更多的模型状态副本,而是要解决模型状态的不一致之处,并利用模型状态在数据平行状态的复制件(即参数和优化状态)来进行故障恢复。我们建议,当复制件对大型模型成本成本昂贵,记录中间数据并重新进行计算,以便在故障发生后恢复损失状态。为了进一步加快故障恢复速度,我们还选择性地记录中间数据,在恢复时间和中期数据存储中进行贸易模式与FIFIFFS的升级方法相比,这样可以大大降低SWIFFFT最后的失败。