High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background. Currently, VELOC adopts a one-file-per-process flush strategy, which results in a large number of files being written to external storage, thereby overwhelming metadata servers and making it difficult to transfer and access checkpoints as a whole. This paper discusses the viability and challenges of designing aggregation techniques for asynchronous multi-level checkpointing. To this end we implement and study two aggregation strategies, their limitations, and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing.
翻译:高性能计算(HPC)应用程序需要大规模地检查大量数据。像VELOC(高低超检查点战略)这样的多级非同步检查站运行时间在应用科学家中越来越受欢迎,因为他们有能力利用快速节点本地存储,并独立冲到背景中稳定的外部存储(例如平行文件系统)中。目前,VELOC采用了一个单文件-流程冲洗战略,导致大量文档被写到外部存储处,从而压倒了元数据服务器,使整个检查站难以传输和进入。本文讨论了设计无节点多级检查站集成技术的可行性和挑战。为此,我们实施并研究两个集成战略,其局限性,并专门为不同步的多级检查站提出一个新的集成战略。