The increasing size of HPC architectures makes the faults' presence a more and more frequent eventuality. This issue becomes especially relevant since MPI, the de-facto standard for inter-process communication, lacks proper fault management functionalities. Past efforts produced extensions to the MPI standard that enabled fault management, including ULFM. While providing powerful tools to handle faults, it still faces limitations like the collectiveness of the repair procedure. With this paper, we overcome those limitations and achieve fault-aware non-collective communicator creation and reparation. We integrate our solution into an existing fault resiliency framework and measure the overhead introduced in the application code. The experimental campaign shows that our solution is scalable and introduces a limited overhead, and the non-collective reparation is a viable opportunity for ULFM-based applications.
翻译:HPC结构的日益扩大使得断层的存在越来越经常发生。 这个问题变得特别相关, 因为程序间通信的“实际标准”MPI(MPI)缺乏适当的断层管理功能。 过去的努力扩大了MPI(MPI)标准的范围,允许管理断层, 包括ULFM。 虽然它提供了处理断层的有力工具,但它仍然面临着诸如修复程序的集体性等限制。 有了这份文件,我们克服了这些限制,实现了意识到断层的非集体通信器的创建和赔偿。 我们把我们的解决方案纳入了现有的断层复原力框架,并测量了应用代码中引入的间接费用。 实验运动表明,我们的解决方案是可以伸缩的,引入了有限的间接费用,而非集体赔偿是基于ULFM的应用的可行机会。