Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks. In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting suitable CPU and memory allocations. We provide a prototypical implementation in the well-known scientific workflow management system Nextflow, evaluate our approaches with five workflows, and compare them against the default resource configurations and a state-of-the-art feedback loop baseline. The evaluation yields that our reinforcement learning approaches significantly reduce resource wastage compared to the default configuration. Further, our approaches also reduce the allocated CPU hours compared to the state-of-the-art feedback loop by 6.79% and 24.53%.
翻译:科学工作流程是设计成定向循环图(DAGs),由多种依赖性任务定义组成。这些工作流程是用大量数据执行的,往往导致数千项任务,其计算要求和时间长,甚至是集束基础设施。为了优化工作流程业绩,需要为各自任务提供足够的资源,例如CPU和记忆。一般而言,工作流程系统依靠已知高度易出错并可能导致过多或不足的用户资源估算。虽然资源过多提供导致资源浪费,但供应不足可能导致长期时间过长甚至失败。在本文件中,我们建议了两种基于梯度缩放和Q学习的不同强化学习方法,以便通过选择适当的CPU和记忆分配来尽量减少资源浪费。我们在著名的科学工作流程管理系统“下流”中提供一种典型的实施方法,用五个工作流程来评估我们的方法,并与默认的资源配置和最新反馈循环基准进行比较。评估显示,我们强化的CUDRUD方法大大降低了我们配置的频率。