Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service. In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets.
翻译:分布式流处理系统(DSP)能够处理大量连续数据流,以在接近实时的情况下产生结果,它们是许多数据密集型应用和分析平台的重要部分。DSP系统收到事件的速度随着时间的推移可能有很大差异,这可能是由于数据流中的趋势、周期性和季节性模式造成的。对收到的工作量的先验了解有助于对资源管理和优化任务采取积极主动的办法,如动态缩放、资源现场迁移、运行时配置参数的调整,从而有可能提高服务质量。我们在本文件中对DSP工作的不同负荷预测技术进行了全面评估。我们确定了三个使用案例,并制定了对DSP工作进行负载预测的要求。正在对来自典型DSP域的九个不同的数据集,即IoT、Web2.0和集群集监测进行自动优化的经典和深层学习方法评估。我们比较了总体准确性和培训期限的模型性表现。我们的成果显示,深学习方法为大多数被评估的数据集提供了最准确的负载预测。