中中度危险引发的崩溃事故的恢复 (Near-zero Downtime Recovery from Transient-error-induced Crashes)

Due to the system scaling, transient errors caused by external noises, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming extreme-scale high-performance-computing (HPC) systems. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this paper, we present IterPro, a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. IterPro repairs the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. IterPro also exploits side effects introduced by induction variable based code optimization techniques to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated IterPro with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, IterPro incurs almost zero runtime overhead and a small, fixed 27MB memory overhead. Meanwhile, IterPro can recover on an average 83.55% of crash-causing errors within dozens of milliseconds with negligible downtime. With such an effective recovery mechanism, IterPro could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.

翻译：由于系统规模的扩大,由外部噪音(如热通量和粒子撞击)造成的瞬时错误,已成为当前和即将到来的极端高性能计算(HPC)系统日益令人关切的问题。然而,由于与无过失案例相比,这些错误仍然很少,理想的解决办法要求采用低/无过失系统,这些系统不会在无过失条件下损害性能,而且可以快速恢复故障,以尽量减少故障时间。在本文件中,我们介绍了IterPro,一种轻量级编译员协助的复原能力技术,以快速和准确地从瞬时引发的崩溃中恢复进程。IterPro在出现错误时修复腐败进程状态,使申请继续执行而不是终止。IterProPro还利用基于感应变码的代码优化技术带来的边效应,以提高其恢复能力。为此,我们用4个新的代码转换通行证来暴露其侧面效应。我们评估了4个科学工作量以及NPB基准套。在正常执行期间,它可以减少25级的机头机率,在正常的机尾机尾操作中,在正常的机尾机尾运行中,可以恢复一个固定的机尾机尾的机尾运行。在正常的机尾操作机制中,在正常的机尾的机尾部中,可以恢复中进行。