Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.
翻译:为大型组群的数据处理工作选择适当的计算资源是困难的,即使是数据工程师等专家用户也是如此。选择不充分可能导致成本大幅上升,而不能显著改善性能。选择高效资源配置的一个关键方面是避免记忆瓶颈。通过事先了解工作所需的记忆,可以大大缩小最佳资源配置的搜索空间。因此,我们向Ruya介绍一种基于迭接探索缩小搜索空间的数据处理群集配置的记忆-意识优化方法。首先,我们用一个仅仅用来模拟工作记忆使用模式的单一机器上的小数据集样本进行工作特征分析运行。第二,我们优先考虑具有适当总内存量的集群配置,并在这个缩小的搜索空间内,我们迭接地搜索最佳的群集配置,利用Bayesian优化。这一搜索过程一旦在被认为最适合给给特定任务配置时就停止了。在1031 Spark和Hadoop的数据集上进行的评估显示,我们减少了搜索频率,以找到与基线相比最佳配置的大约一半。