Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the \textit{de facto} standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Results are presented for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors.
翻译:可靠的系统需要有效的故障识别监测技术。系统一级的诊断最初是在1960年代提出的,作为监测和识别一般系统缺陷组成部分的测试方法。在过去几十年中,根据不同的故障模型提出了若干诊断模式和战略,并适用于最多样化的计算机系统。在1990年代,不可靠的故障探测器作为一种抽象出现,以便在发生故障的不同步系统中达成共识。自那以后,故障探测器已成为监测分布式系统的标准。目前工作的目的是通过展示一个与不可靠的故障探测器相一致的分布式诊断模型来填补概念上的空白。提出了所需测试/监测信息的数量、事件探测时间以及完整性和准确性方面的结果。提出了三种与拟议模型相符的故障探测器,包括VRing和VCube,它们为大多数现有故障探测器采用的传统全光学战略提供了可扩缩的替代品。