HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes. Mathematical solvers, in particular, iterative linear solvers are key building block in numerous large-scale scientific applications. Consequently, supporting the recovery of distributed solvers is necessary for scaling scientific applications to exascale platforms. Previous recovery methods for iterative solvers are based on Checkpoint-Restart (CR), which incurs high fault tolerance overhead, or intrinsic fault tolerance, which require extra computation time to converge after failures. Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of the computation state while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads. Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience. Our mechanism, called in-NVRAM ESR, is based on a novel MPI One-Sided Communication (OSC) over RDMA implementation, and provides full resiliency while significantly reducing both the memory footprint and the time overhead in comparison with the original ESR design (in-RAM ESR).
翻译:HPC系统是科学研究的关键资源。对计算力和记忆记忆需求的增加在超大型时代带来了巨大的计算能力。超级计算机的设计是用来提供巨大的计算能力以满足这些需要的。这些复杂的超级计算机由众多的计算节点组成,因此预计会经历频繁的故障和碰撞。数学解答器,特别是迭代线求解器是许多大规模科学应用中的关键构件。因此,支持已分配解答器的恢复对于将科学应用推广到超大型平台是必要的。对于迭代解答器,以往的恢复方法基于检查点-再启动(CR),这需要高错容度的间接费用,或内在的过错耐受力,而这需要额外的计算时间在失败后会合。 提议了超常的状态重建(ESR)作为替代机制来减轻经常失灵对长期计算的影响。 ESR已经表明,精确的计算状态可以提供精确的重建,同时避免昂贵的关卡。但是,ESR(ExSR)目前依靠不稳定的记忆,因此必须保持多点-Rest(C)的内存内存、高存和内存的内存和内存)内部网络的内存机制可以大大地进行升级的升级的升级的升级的升级的升级的升级的升级的升级,而使得正在的 ESR(ESRMARMDRMDR的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级