Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get transparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSC's diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSC's production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike.
翻译:检点/再启动(C/R)提供容错计算能力,允许长期运行应用程序,并为计算中心提供时间表灵活性,以支持不同优先事项的不同工作量,因此,至关重要的是,Garg等人在NERSC.MANAA获得透明的C/R工作能力,这是一个透明的检查工具,是因其MPI-不可知性和网络-不可知性方法而选择的。然而,最初作为概念校对代码写成的,MANA没有准备好与NERSC的多种生产工作量一起使用,这些工作量由MPI和混合MPI+OpenMP应用程序占据主导。在这次谈话中,我们向NERSC展示了正在进行的工作,以使MANSC的生产工作量能够由NERSC顶级应用所暴露的故障处理,增加处理系统变化的新特征,大规模评价C/R管理费用,等等。从使MANA为HPC生产做好准备而应用中学到的经验教训,对于C/R工具开发商、超级计算中心和HPC终端用户都有用。