Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.
翻译:许多组织利用分布式数据平行处理系统和商品资源组群系统,对大型数据集进行例行分析;然而,用户需要为其数据处理工作配置充足的资源;这需要深入了解预期的工作运行时间,并推广行为、资源特点、投入数据分布和其他因素;无法准确估计业绩,用户往往过多地为其工作提供资源,导致资源利用率低和成本高;在本文件中,我们提出了以协作方式优化基于运行时间数据和性能模型的数据处理集群配置的主要构件;我们认为,运行时间数据可以共享,用于不同执行环境的业绩模型,大大减少对重复个别处理工作的依赖,或者减少对专门工作剖析的依赖;为此,我们说明如何利用处理工作和集束基础设施的相似性,将当地和全球执行工作的适当数据点结合到准确的业绩模型中;此外,我们概述了通过更符合背景和可再使用的模式进行业绩预测的方法;最后,我们阐述了如何将以往处决的计量与运行时间监测结合起来,以便有效地重新配置模型和分组。