Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications. However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance. As modern HPC infrastructures continue to evolve, there is a growing gap between compute capacity vs. I/O capabilities. Furthermore, the storage hierarchy is becoming increasingly heterogeneous: in addition to parallel file systems, it comprises burst buffers, key-value stores, deep memory hierarchies at node level, etc. In this context, state of art is insufficient to deal with the diversity of vendor APIs, performance and persistency characteristics. This extended abstract presents an overview of VeloC (Very Low Overhead Checkpointing System), a checkpointing runtime specifically design to address these challenges for the next generation Exascale HPC applications and systems. VeloC offers a simple API at user level, while employing an advanced multi-level resilience strategy that transparently optimizes the performance and scalability of checkpointing by leveraging heterogeneous storage.
翻译:与稳定存储同时核对大量相关数据是许多高聚苯乙烯应用的一种常见的I/O模式,但这种模式往往导致I/O瓶颈,导致可缩放性和性能差。随着现代高聚苯乙烯基础设施的继续发展,计算能力与I/O能力之间的差距日益扩大。此外,存储等级越来越不一:除了平行文件系统外,它还包括爆裂缓冲、关键价值仓库、节点级的深记忆等级等。在这方面,最新状态不足以应对供应商API的多样性、性能和持久性特点。这一扩展摘要概述了VeloC(高超标检查系统),这是为下一代Exascale HPC应用程序和系统专门设计应对这些挑战的检查站运行时间。 VeloC在用户一级提供了一个简单的API系统,同时采用先进的多级复原力战略,以透明方式优化通过使用混合存储优化检查的性能和可缩放性。