Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded.
翻译:分布式数据处理系统,如MapReduce、Spark和Flink等,是分析大型数据集集集资源的流行工具。然而,用户往往为数据处理工作提供过多的资源,而这些工作的资源使用也通常波动很大。因此,多份工作通常被安排在同一共享资源上,以增加资源利用和集群的吞吐量。然而,工作运行时间和共享资源的利用可能因合用工作的具体组合而大不相同。本文介绍雨果,这是不断学习如何高效地共享资源、考虑资源利用指标和共同部署工作之间干扰的群集调度器。调度器将脱线工作与在线强化学习结合起来,以提供一个安排机制,有效地从特定监测的工作组合中归纳,同时适应工作量的变化。我们对原型的评估表明,该方法可以将YARN组模范的Spark工作运行时间降低到12.5%,同时资源利用增加,等待的时间可以被捆绑在一起。