MPI is the de facto standard for parallel computation on a cluster of computers. Yet resilience for MPI continues to be an issue for large-scale computations, and especially for long-running computations that exceed the maximum time allocated to a job by a resource manager. Transparent checkpointing (with no modification of the underlying binary executable) is an important component in any strategy for software resilience and chaining of resource allocations. However, achieving low runtime overhead is critical for community acceptance of a transparent checkpointing solution. ("Runtime overhead" is the overhead in time when running an application with no checkpoints, both with and without the checkpointing package.) A collective-vector-clock algorithm for transparent checkpointing of MPI is presented. The algorithm is built using the software of the mature MANA project for transparent checkpointing of MPI. MANA's existing two-phase-commit algorithm produces very high runtime overhead as compared to "native" execution. For example, MANA was found to result in runtime overheads as high as 37% on some real-world programs, and up to 800% on some typical micro-benchmarks -- especially on workloads that intensively use collective communication. The new algorithm replaces two-phase commit. It is a novel variation on vector clock algorithms. It uses a vector of logical clocks, with an individual clock for each distinct group of MPI processes underlying the MPI communicators in the application. This contrasts with the traditional vector of logical clocks across individual processes. Micro-benchmarks show a runtime overhead of essentially zero for many MPI processes. And two real-world applications, VASP and GROMACS, show a runtime overhead ranging mostly from 0% to 7% (VASP) and 1% to 14% (GROMACS) -- even before further analysis and optimization of other sources of overhead.
翻译:MPI 是一组计算机平行计算的实际标准 。 然而, MPI 的适应性仍然是大规模计算的一个问题, 特别是长期计算超过资源管理员分配给任务的最大时间的长期计算。 透明检查( 不修改基本的二进制执行程序) 是软件复原力和资源分配链的任何战略的重要组成部分 。 然而, 实现低运行时间管理对于社区接受透明检查解决方案至关重要 。 ( “ 运行时间管理” 是运行一个应用程序时的间接费用, 不论是否设置检查套件 。 ), 特别是对于长期运行超过资源管理员分配给任务管理者的最大时间计算。 透明检查是一个透明检查( 不修改基本二进制的二进制二进制二进制程序 ) 。 MANA 现有的两阶段配置算算法比“ 初始” 执行过程要高14个运行时间管理器。 例如, MANA 发现运行时的运行时间管理器在运行一些实体程序上高达37%的运行时间管理器, 并且在某些典型的MYAC 系统上高达800 % 。, 运行一个新的系统运行一个新的系统 。