Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75% data reduction only increases prediction errors by one percentage point.
翻译:Apache Flink 和 Apache Spark 等分布式数据流系统简化了以数据平行方式处理大量集群数据的方法。然而,选择合适的组群资源,在类型和数量上都用于分布式数据流工作是困难的,特别是对于没有机会获得以前的业绩指标的用户而言。解决这一问题的一个办法是让用户共享运行时间指标,以培训有助于找到当前工作适当配置的符合背景的运行性能模型。共享运行时间数据而不是经过培训的模型或模型参数的一个问题就是数据规模会随着时间的推移大幅增长。本文审查了几个组群技术,以尽量减少培训数据规模,同时保持相关性能模型的准确性能。我们的结果表明,数据传输、存储和模型培训的效率可以通过培训数据减少来实现。在评估930个独特的分布式数据流工作中运行时间数据数据集时,我们发现,平均75%的数据减少只会增加一个百分点的预测误差。