This paper proposes a learned cost estimation model for Distributed Stream Processing Systems (DSPS) with an aim to provide accurate cost predictions of executing queries. A major premise of this work is that the proposed learned model can generalize to the dynamics of streaming workloads out-of-the-box. This means a model once trained can accurately predict performance metrics such as latency and throughput even if the characteristics of the data and workload or the deployment of operators to hardware changes at runtime. That way, the model can be used to solve tasks such as optimizing the placement of operators to minimize the end-to-end latency of a streaming query or maximize its throughput even under varying conditions. Our evaluation on a well-known DSPS, Apache Storm, shows that the model can predict accurately for unseen workloads and queries while generalizing across real-world benchmarks.
翻译:本文提出了分布式溪流处理系统(DSPS)的学习成本估算模型,目的是对执行查询提供准确的成本预测,这项工作的一个主要前提是,拟议的学习模型可以概括到流程工作量的动态,这意味着一旦经过培训的模型可以准确预测延时和吞吐等性能指标,即使数据和工作量的特点或操作员在运行时调换硬件的情况也是如此。 这样,该模型可以用来解决各种任务,例如优化操作员的职位安排,以最大限度地减少流动查询的端到端的延时或甚至在不同条件下最大限度地增加其吞吐量。 我们对众所周知的DSPS、阿帕奇风暴的评估表明,该模型可以准确预测看不见的工作量和查询,同时将整个现实世界的基准加以概括。