Most online service providers deploy their own data stream processing systems in the cloud to conduct large-scale and real-time data analytics. However, such systems, e.g., Apache Heron, often adopt naive scheduling schemes to distribute data streams (in the units of tuples) among processing instances, which may result in workload imbalance and system disruption. Hence, there still exists a mismatch between the temporal variations of data streams and such inflexible scheduling scheme designs. Besides, the fundamental benefits of predictive scheduling to data stream processing systems also remain unexplored. In this paper, we focus on the problem of tuple scheduling with predictive service in Apache Heron. With a careful choice in the granularity of system modeling and decision making, we formulate the problem as a stochastic network optimization problem and propose POTUS, an online predictive scheduling scheme that aims to minimize the response time of data stream processing by steering data streams in a distributed fashion. Theoretical analysis and simulation results show that POTUS achieves an ultra-low response time with queue stability guarantee. Moreover, POTUS only requires mild-value of future information to effectively reduce the response time, even with mis-prediction.
翻译:大多数在线服务提供商在云层中部署自己的数据流处理系统,以进行大规模实时数据分析,然而,这类系统,例如Apache Heron,往往采用天真的排期计划,在处理实例中分配数据流(在图普勒单位),这可能造成工作量不平衡和系统中断,因此,数据流的时间变化与这种不灵活的排期计划设计之间仍然存在不匹配。此外,预测数据流处理系统的排期的基本好处也仍未得到探讨。在本文中,我们侧重于在阿帕奇赫隆提供预测服务时的排期问题。在系统建模和决策的颗粒中,我们谨慎地选择了这一问题作为随机网络优化问题,并提出POTUS这一在线预测排期计划,目的是通过以分布式的方式指导数据流处理,最大限度地减少数据流处理的响应时间。理论分析和模拟结果表明,POTUS在排队稳定的情况下,只需对未来信息作微值的响应时间进行有效减少,甚至有误差保证。