SWIFT: 大规模DNN培训的快速故障恢复 (SWIFT: Expedited Failure Recovery for Large-scale DNN Training)

As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance more and more critical. Existing state-of-the-art methods like CheckFreq and Elastic Horovod need to back up a copy of the model state (i.e., parameters and optimizer states) in memory, which is costly for large models and leads to non-trivial overhead. This paper presents SWIFT, a novel recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, SWIFT resolves the inconsistencies of the model state caused by the failure and exploits the replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. The re-computation is distributed across multiple machines to accelerate failure recovery further. We also log intermediate data selectively, exploring the trade-off between recovery time and intermediate data storage overhead. Evaluations show that SWIFT significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy. SWIFT can also achieve up to 1.16x speedup in total training time compared to state-of-the-art methods.

翻译：随着深层次学习模式规模的扩大和不断扩大,培训需要更长的时间和资源,使错误容忍度越来越重要。现有的先进方法,如CriggFreq和Elastic Horovod,需要备份一个模型状态(即参数和优化状态)的复制件,用于记忆中,对于大型模型来说,这成本很高,并导致非三重性间接费用。本文介绍了SWIFT,这是用于分布式深度神经网络培训的新型恢复设计,它大大降低了故障恢复的间接费用,而不会影响培训通过量和模型准确性。SWIFT不是要制作更多的模型状态副本,而是要解决模型状态的不一致之处,并利用模型状态在数据平行状态的复制件(即参数和优化状态)来进行故障恢复。我们建议,当复制件对大型模型成本成本昂贵,记录中间数据并重新进行计算,以便在故障发生后恢复损失状态。为了进一步加快故障恢复速度,我们还选择性地记录中间数据,在恢复时间和中期数据存储中进行贸易模式与FIFIFFS的升级方法相比,这样可以大大降低SWIFFFT最后的失败。

相关内容

Swift

关注 101

苹果公司在 WWDC 2014 开幕 Keynote 上发布的全新编程语言，具有更多现代化特性，同时容易使用，定位是补充 Objective-C. > Swift is an innovative new programming language for Cocoa and Cocoa Touch. Writing code is interactive and fun, the syntax is concise yet expressive, and apps run lightning-fast. Swift is ready for your next iOS and OS X project — or for addition into your current app — because Swift code works side-by-side with Objective-C.

Swift - Apple Developer

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务