A distributed system consisting of a huge number of computational entities is prone to faults, because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue. Checkpoint-rollback recovery is a universal and representative technique for fault tolerance; it periodically records the entire system state (configuration) to non-volatile storage, and the system restores itself using the recorded configuration when the system fails. To record a configuration of a distributed system, a specific algorithm known as a snapshot algorithm is required. However, many snapshot algorithms require coordination among all nodes in the system; thus, frequent executions of snapshot algorithms require unacceptable communication cost, especially if the systems are large. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a partial snapshot (instead of a global snapshot). However, if two or more partial snapshot algorithms are concurrently executed, and their snapshot domains overlap, they should coordinate, so that the partial snapshots (taken by the algorithms) are consistent. In this paper, we propose a new efficient partial snapshot algorithm with the aim of reducing communication for the coordination. In a simulation, we show that the proposed algorithm drastically outperforms the existing partial snapshot algorithm, in terms of message and time complexity.
翻译:由大量计算实体组成的分布式系统容易出错,因为几个节点的错误导致整个系统失败。 因此,分布式系统的错误容忍度是一个关键问题。 检点回滚回回收是一种通用且有代表性的反差容忍技术; 它定期将整个系统状态(配置)记录为非挥发性存储, 系统在系统失灵时使用记录配置恢复了自己。 要记录分布式系统的配置, 需要一种被称为快照算法的具体算法。 但是, 许多快照算法需要系统所有节点之间的协调; 因此, 频繁执行快照算法需要令人无法接受的通信成本, 特别是如果系统规模很大, 。 作为精密的快照算法, 部分快照算法已经引入了部分快照( 而不是全球快照 ) 。 但是, 如果同时执行两个或两个以上的部分快照算法, 以及系统快照域重叠, 系统就应该进行协调, 使部分快照( 由算法所接受的) 一致。 然而, 许多快照算法需要系统所有节点之间的协调; 因此, 经常执行快照算算算算算法需要一个新的高效的部分缩算法, 来减少部分通信, 以协调。 。 。 在急剧中, 我们提出一个新的高效的缩算法中, 。 。 。