Disaggregated memory architectures provide benefits to applications beyond traditional scale out environments, such as independent scaling of compute and memory resources. They also provide an independent failure model, where computations or the compute nodes they run on may fail independently of the disaggregated memory; thus, data that's resident in the disaggregated memory is unaffected by the compute failure. Blind application of traditional techniques for resilience (e.g., checkpoints or data replication) does not take advantage of these architectures. To demonstrate the potential benefit of these architectures for resilience, we develop Memory-Oriented Distributed Computing (MODC), a framework for programming disaggregated architectures that borrows and adapts ideas from task-based programming models, concurrent programming techniques, and lock-free data structures. This framework includes a task-based application programming model and a runtime system that provides scheduling, coordination, and fault tolerance mechanisms. We present highlights of our MODC prototype and experimental results demonstrating that MODC-style resilience outperforms a checkpoint-based approach in the face of failures.
翻译:分离的记忆结构为传统缩放环境以外的应用提供了益处,如独立缩放计算和记忆资源等。它们还提供了一个独立的失败模型,其计算或计算节点的计算可能与分类记忆脱钩;因此,分类记忆中所含的数据不受计算失败的影响。盲目应用传统抗御技术(例如检查站或数据复制)并不利用这些结构。为了展示这些结构对复原力的潜在好处,我们开发了以记忆为导向的分布式计算机(MODC),这是一个编程分解结构的框架,用以借用和调整基于任务的编程模型、同时编程技术和无锁数据结构中的想法。这个框架包括基于任务的应用程序编程模型和一个运行时间系统,提供时间安排、协调和容错机制。我们介绍了我们的MODC原型和实验结果的亮点,表明MODC型的复原力在面临失败时超越了以检查站为基础的办法。