All modern distributed systems list performance and scalability as their core strengths. Given that optimal performance requires carefully selecting configuration options, and typical cluster sizes can range anywhere from 2 to 300 nodes, it is rare for any two clusters to be exactly the same. Validating the behavior and performance of distributed systems in this large configuration space is challenging without automation that stretches across the software stack. In this paper we present Fallout, an open-source distributed systems testing service that automatically provisions and configures distributed systems and clients, supports running a variety of workloads and benchmarks, and generates performance reports based on collected metrics for visual analysis. We have been running the Fallout service internally at DataStax for over 5 years and have recently open sourced it to support our work with Apache Cassandra, Pulsar, and other open source projects. We describe the architecture of Fallout along with the evolution of its design and the lessons we learned operating this service in a dynamic environment where teams work on different products and favor different benchmarking tools.
翻译:所有现代分布式系统都列出其核心优势的性能和可扩缩性。鉴于最佳性能要求仔细选择配置选项,且典型的组群大小可能介于2至300个节点之间,任何两个组群都很少完全相同。在这个大型配置空间验证分布式系统的行为和性能没有软件堆叠的自动化是困难的。在本文中我们介绍Fallout(一个开放源码分布式系统测试服务),它自动提供和配置分布式系统和客户,支持运行各种工作量和基准,并根据所收集的视觉分析指标生成业绩报告。我们已在数据税局内部运行Fallout服务5年多的时间,并于最近开源支持我们与Cassandra、Pulsar和其他开放源项目的工作。我们描述了Fallout(Fallout)的架构及其设计演进以及我们在一个动态环境中运行这一服务的经验教训,在这个环境中,团队在不同的产品上开展工作,并且支持不同的基准工具。