分布式系统中的自我愈合难题:过失更正与过错容忍 (Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance)

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencies generated by fault-correction and fault-tolerance when each node in a network can monitor the health status of another node, while both can defect. In contrast to related work, no information about the computational/application scenario, overlying algorithms or application data is required. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds, each with almost 9M measurements of inconsistencies in a prototyped decentralized network of 3000 nodes. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing at design phase.

翻译：通过非同步通信进行互动的自治代理人大规模分散系统往往经历以下自我愈合的两难困境:发现错误后继承网络的不确定性,使得一个错误的过程与缓慢的过程无法区分。影响可能非常大:自我愈合机制变得偏向,而且成本效益低。特别是,引发不可取的错误纠正,导致新的错误纠正,而这种错误容忍可以避免,而这种错误容忍,最终不纠正持续错误的起源,使系统也表现不佳。衡量、理解和解决这种自我愈合的两难困境,是一个及时的挑战和关键要求,因为分布式分类账、边缘计算、能源、运输和健康等若干应用领域的事物互联网的出现,因此,可以产生一种错误情形的新型和通用模型。在网络的每个节点能够监测另一个节点的健康状况时,它们就能准确衡量和预测出不一致之处。衡量、理解和解决这种自我愈合的两难的两难困境,与相关的工作相比,衡量、计算/应用假设的计算/应用情景、过度计算或应用错误的精确度计算,几乎需要9个实验性模型设计阶段的精确的模型。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日