Studies have demonstrated that Apache Spark, Flink and related frameworks can perform stream processing at very high frequencies, whilst tending to focus on small messages with a computationally light `map' stage for each message; a common enterprise use case. We add to these benchmarks by broadening the domain to include loads with larger messages (leading to network-bound throughput), and that are computationally intensive (leading to CPU-bound throughput) in the map phase; in order to evaluate applicability of these frameworks to scientific computing applications. We present a performance benchmark comparison between Apache Spark Streaming (ASS) under both file and TCP streaming modes; and HarmonicIO, comparing maximum throughput over a broad domain of message sizes and CPU loads. We find that relative performance varies considerably across this domain, with the chosen means of stream source integration having a big impact. We offer recommendations for choosing and configuring the frameworks, and present a benchmarking toolset developed for this study.
翻译:研究表明,Apache Spark、Flink和相关框架可以在非常高的频率上进行流流处理,同时倾向于以每个电文的计算光“马图”阶段的小型电文为重点;一个共同的企业使用案例。我们将这些基准添加到这些基准中,扩大域以包括含有较大电文的负载(导致网络覆盖的吞吐量),并在地图阶段进行大量计算(导致CPU约束的吞吐量);以便评价这些框架对科学计算应用的适用性。我们介绍了档案和TCP流模式下的Apache Spark Streaming(ASS)之间的业绩基准比较;以及“Halonicio”对信息大小和CPU载量的广泛域的最大吞吐量进行比较。我们发现,这个域的相对性能差异很大,而所选择的流源集集集集集方式影响很大。我们提出了选择和配置框架的建议,并提出为这项研究制定的基准工具。