Large-scale, fault-tolerant, distributed systems are the backbone for many critical software services. Since they must execute correctly in a possibly adversarial environment with arbitrary communication delays and failures, the underlying algorithms are intricate. In particular, achieving consistency and data retention relies on intricate consensus (state machine replication) protocols. Ensuring the reliability of implementations of such protocols remains a significant challenge because of the enormous number of exceptional conditions that may arise in production. We propose a methodology and a tool called Netrix for testing such implementations that aims to exploit programmer's knowledge to improve coverage, enables robust bug reproduction, and can be used in regression testing across different versions of an implementation. As evaluation, we apply our tool to a popular proof of stake blockchain protocol, Tendermint, which relies on a Byzantine consensus algorithm, a benign consensus algorithm, Raft, and BFT-Smart. We were able to identify deviations of the Tendermint implementation from the protocol specification and verify corrections on an updated implementation. Additionally, we were able to reproduce previously known bugs in Raft.
翻译:大型的、容错的、分布式的系统是许多关键软件服务的支柱。由于这些系统必须在可能存在任意通信延误和故障的敌对环境中正确执行,因此基本的算法是复杂的。特别是,实现一致性和数据保留取决于复杂的协商一致(国家机器复制)协议。确保执行这些协议的可靠性仍然是一项重大挑战,因为生产过程中可能出现大量例外条件。我们提出了一个方法和工具,称为Netrix,用于测试这类实施,目的是利用程序员的知识来扩大覆盖范围,实现稳健的错误复制,并可用于在不同版本的实施中进行回归测试。作为评估,我们运用我们的工具来进行关于供应链协议的大众证明,即Tendermint,它依赖于拜占坦共识的协商一致算法、友好的协商一致算法、Raft和BFT-Smart。我们能够查明Tendermint执行偏离协议规格的情况,并核查更新执行过程中的纠正。此外,我们复制了Raft的先前已知的错误。</s>