Distributed stream processing engines are designed with a focus on scalability to process big data volumes in a continuous manner. We present the Theodolite method for benchmarking the scalability of distributed stream processing engines. Core of this method is the definition of use cases that microservices implementing stream processing have to fulfill. For each use case, our method identifies relevant workload dimensions that might affect the scalability of a use case. We propose to design one benchmark per use case and relevant workload dimension. We present a general benchmarking framework, which can be applied to execute the individual benchmarks for a given use case and workload dimension. Our framework executes an implementation of the use case's dataflow architecture for different workloads of the given dimension and various numbers of processing instances. This way, it identifies how resources demand evolves with increasing workloads. Within the scope of this paper, we present 4 identified use cases, derived from processing Industrial Internet of Things data, and 7 corresponding workload dimensions. We provide implementations of 4 benchmarks with Kafka Streams and Apache Flink as well as an implementation of our benchmarking framework to execute scalability benchmarks in cloud environments. We use both for evaluating the Theodolite method and for benchmarking Kafka Streams' and Flink's scalability for different deployment options.
翻译:分布式流处理引擎的设计重点是以连续方式处理大数据量的可扩展性。我们提出了一个通用基准框架,用于为分布式流处理引擎的可扩展性制定基准。我们介绍了用于对分布式流处理引擎的可扩展性进行基准衡量的Theodolite方法。这一方法的核心是确定执行流处理的微服务必须完成的使用案例的定义。对于每个使用案例,我们的方法确定了可能影响使用案例可扩展性的相关工作量层面。我们建议为每个使用案例和相关工作量层面设计一个基准。我们提出了一个一般基准框架,可用于执行特定使用案例和工作量层面的单个基准。我们的框架针对特定层面的不同工作量和各种处理案例的不同工作量执行使用案例的数据流结构。这样,它确定了资源需求如何随着工作量的增加而变化。在本文范围内,我们提出了4个来自处理工业互联网用户数据的案例,以及7个相应的工作量层面。我们为Kafka Streams和Apac Flink提供了4个基准框架的实施情况,以及我们的基准框架的实施是为了在云层环境中执行可扩展性基准基准基准基准。我们使用不同的卡夫卡的可选办法和卡卡利标准。