The increasing size of HPC architectures makes the faults' presence an eventuality more and more frequent. This is especially relevant since MPI, the de-facto standard for inter-process communication lacks proper fault management functionalities. The past efforts produced extensions to the MPI standard that enabled fault management, the most important one being ULFM. In this paper, we introduce the support for non-collective communication creation (MPI_Comm_create_group) in ULFM to improve the fault management capabilities. We integrate our solution into the Legio library and measure the overhead introduced in the application. The proposed solution removes the possibility of turning the execution into a deadlock after a fault and can be used as an inspiring effort to improve the ULFM repair capabilities.
翻译:HPC结构的日益扩大使得断层的存在越来越频繁,这尤其具有相关性,因为程序间通信的实际标准MPI缺乏适当的断层管理功能。过去的努力使MPI标准得到扩展,从而得以进行断层管理,其中最重要的一项是ULFM。在本文件中,我们介绍了对ULFM的非集成通信创建的支持(MPI_Comm_create_group),以提高断层管理能力。我们将我们的解决方案融入了Legio图书馆,并测量了应用程序中引入的间接费用。拟议的解决方案消除了将执行在出错后陷入僵局的可能性,并可以用作改进ULFM的修复能力的一种鼓舞人心的努力。