While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.
翻译:虽然检查站通常与重新启用整个应用程序相结合,但局部恢复允许除受影响程序之外的所有进程继续。例如,在基于任务的集群编程中,应用程序随后可以在完整的节点上完成,损失的任务可以重新分配。这个扩展的抽象建议是调整最初为独立任务而开发的检查站和局部恢复技术,以嵌套叉路程序。我们考虑在分布式内存设置中采用类似Cilk的工作盗窃计划,在工作第一政策,并描述所需的算法变化。最初的技术将间接费用设于1%以下,而回收费用则被忽略,我们期望新的算法能够取得类似的性能。