RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee application performance is identified as the key concern for RECIPE, and is addressed via hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modelling thermal properties, mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.
翻译:RECIPE(对多种加速器系统进行可靠的时间和时间压力预测管理)是最近在H2020 FETHPC方案内资助的一个项目,其明确目标是探索新的高性能计算技术(HPC),RECIPE旨在引入一个分级运行时间资源管理基础设施,以优化能源效率和尽量减少热热点的出现,同时执行应用所带来的时间限制,并确保对在极不相同的加速器基础上的系统进行的时间临界和以吞吐量为导向的计算具有可靠性。本文件详细概述了RECIPE,查明基本挑战以及该项目涉及的关键创新。特别是,预测可靠性方法的必要性,以最大限度地提高硬件寿命和保证应用性能,被确定为RECIPE的主要关注事项,并通过系统各种建筑组成部分的分级资源管理加以解决,其驱动因素是对分别通过时间分析和模拟热力特性、平均时间到政策失灵的热性能进行估计。我们展示了预测准确性能对各子系统的预测的影响,作为可能采用的检查点,我们展示了对天气预测的准确性、作为可能采用的检查点的精确性试验的影响。