High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
翻译:高性能计算继续增加其计算力和能源效率,然而,能源消耗继续上升,寻找限制和(或)减少能源消耗的方法,是当前研究的一个关键点。对于高性能的MPI应用,有基于反向回收的缺陷容忍方法,如不协调的检查站。这些方法只允许一些程序在面临失败时返回,而其余程序则继续运行。在本篇文章中,我们侧重于继续实施的进程,并提出一系列战略,以便在出现故障和使用不协调的检查站时管理能源消耗。我们提出了一个能源模型,用以评价战略,并通过模拟分析不同配置和故障时间的应用行为。结果,我们展示了在出现故障时提高高电电能控制系统能效的可行性。