Scientific applications have long embraced the MPI as the environment of choice to execute on large distributed systems. The User-Level Failure Mitigation (ULFM) specification extends the MPI standard to address resilience and enable MPI applications to restore their communication capability after a failure. This works builds upon the wide body of experience gained in the field to eliminate a gap between current practice and the ideal, more asynchronous, recovery model in which the fault tolerance activities of multiple components can be carried out simultaneously and overlap. This work proposes to: (1) provide the required consistency in fault reporting to applications (i.e., enable an application to assess the success of a computational phase without incurring an unacceptable performance hit); (2) bring forward the building blocks that permit the effective scoping of fault recovery in an application, so that independent components in an application can recover without interfering with each other, and separate groups of processes in the application can recover independently or in unison; and (3) overlap recovery activities necessary to restore the consistency of the system (e.g., eviction of faulty processes from the communication group) with application recovery activities (e.g., dataset restoration from checkpoints).
翻译:用户级减少故障(ULFM)规格扩展了计算阶段成功与否评估,而不会造成无法接受的性能打击; 扩展了计算阶段标准,以解决复原力问题,使多功能类集应用在失败后能够恢复通信能力; 这项工作以在实地获得的广泛经验为基础,以消除当前做法与理想的、更不同步的恢复模式之间的差距,在理想的恢复模式中,多个组成部分的过错容忍活动可以同时进行和重叠; 这项工作建议:(1) 在向应用系统报告错误方面提供必要的一致性(例如,使应用软件能够评估计算阶段的成功,而不会造成无法接受的性能打击); (2) 提出能够有效界定应用软件中故障恢复的构件,以便应用软件中独立组成部分能够在不相互干扰的情况下恢复; 应用程序中的单独程序组群可以独立或相互分离地恢复; (3) 将恢复系统一致性所必需的恢复活动(例如,将错误过程从通信组中驱离出)与应用恢复活动(例如,从检查站恢复数据设置)重叠。