Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes are discarded and the execution continues only with the non-failed ones. A hierarchical implementation of the solution has been also proposed to reduce the overhead of the repair process when scaling towards a large number of nodes. We evaluated our solutions on the Marconi100 cluster at CINECA, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. Moreover, we also integrated the solution in real-world applications to further prove its robustness by injecting faults.
翻译:由于HPC机器规模的扩大,过失的存在正在成为应用必须面对的一种可能性。在本地,MPI不支持在发现故障后执行,这越来越具有限制性。随着ULFM(用户级减少过失图书馆)的引入,在应用执行过程中,它有可能以修改代码为代价超越错误。ULFM在应用中具有侵扰性,并要求深入了解其恢复程序。在本文件中,我们提议Lezio,这是一个降低在尴尬的平行MPI应用中引入复原力复杂性的框架。通过将ULFMT隐藏在MPI调用电话背后,图书馆能够以透明的方式暴露应用程序的弹性特征,从而消除任何整合努力。由于错误,失败节点被丢弃,执行过程只继续以未失败为代价。还提议了解决方案的分级实施,以降低修复过程的间接费用。我们评估了我们在CINECA的Marconi100集群应用中的解决方案的复杂性。通过CINECA将UFMA软件隐藏起来,表明,我们所引入的顶部的解决方案是微不足道的。