Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
翻译:深入学习(DL)应用程序正越来越多地用于HPC系统,以利用这些系统的大规模平行和计算能力进行DL模式培训。虽然已经作出很大努力促进DL框架的分布式培训,但差错容忍度在很大程度上被忽视。在这项工作中,我们评价了检查站重新启动,这是HPC工作量中常见的差错容忍技术。我们试验了HPC链路、PyTorch和TensorFlow共同的三个最先进的DL框架。我们评价了检查站的计算成本、文件格式和文件大小、规模和确定性检查站的影响。我们的评价显示检查站机制存在一些关键差异,暴露了现有检查站执行中的瓶颈。我们提供了一些讨论点,可以帮助用户选择一个在HPC中使用的差错容忍框架。我们还提供了框架开发者可以用来帮助改进HPC对DL工作量的检查的抽取点。