Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of data after process failures. By storing all required data in memory via an appropriate data distribution and replication, recovery is substantially faster than with standard checkpointing schemes that rely on a parallel file system. As the application developer can specify which data to load, we also support shrinking recovery instead of recovery using spare compute nodes. We evaluate ReStore in both controlled, isolated environments and real applications. Our experiments show loading times of lost input data in the range of milliseconds on up to 24 576 processors and a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.
翻译:在现代集束系统中,在出现故障后请求替换资源通常不切实际。因此,应用程序必须继续使用剩余资源。这需要重新分配工作量,且非失败的流程重新装入数据。我们为进程失败后能够恢复数据的MPI程序提出了一个算法框架及其C++图书馆实施ReStore。通过适当数据分发和复制将所有所需数据存储在记忆中,回收速度大大快于依赖平行文件系统的标准检查站计划。由于应用程序开发者可以指定要装载哪些数据,我们也支持减少回收,而不是使用备用计算节点进行回收。我们评估在受控、孤立环境和实际应用中的ReStore。我们的实验显示,在最多24 576个处理器的毫秒范围内输入数据损失的负荷时间,以及广泛使用的生物信息学应用的错误识别版本的恢复时间大大加快。