Scheduling job flows efficiently and rapidly on distributed computing clusters is one of huge challenges for daily operation of data centers. In a practical scenario, a single job consists of numerous stages with complex dependency relation represented as a Directed Acyclic Graph (DAG) structure. Nowadays a data center usually equips with a cluster of heterogeneous computing servers which are different in the hardware/software configuration. From both the cost saving and environmental friendliness, the data centers could benefit a lot from optimizing the job scheduling problems in the heterogeneous environment. Thus the problem has attracted more and more attention from both the industry and academy. In this paper, we propose a task-duplication based learning algorithm, namely \lachesis \footnote{The second of the Three Fates in ancient Greek mythology, who determines destiny.}, aiming to optimize the problem. In the proposed approach, it first perceives the topological dependencies between jobs using a reinforcement learning framework and a specially designed graph neural network (GNN) to select the most promising task to be executed. Then the task is assigned to a specific executor with the consideration of duplicating all its precedent tasks according to an expert-designed rules. We have conducted extensive experiments over standard workloads to evaluate the proposed solution. The experimental results suggest that \lachesisquad can achieve at most 26.7\% reduction of makespan and 35.2\% improvement of speedup ratio over seven strong baseline algorithms, including the state-of-the-art heuristics methods and a variety of deep reinforcement learning based algorithms.
翻译:高效和快速地在分布式计算群群上安排工作流动是数据中心日常运行面临的巨大挑战之一。 在实际的假设中,一个单一的工作由许多阶段的复杂依赖关系组成,以直接的环形图(DAG)结构为代表。现在,一个数据中心通常配备硬件/软件配置不同的混合计算服务器组。从节省成本和环保的角度出发,数据中心可以从优化不同环境中的工作时间安排问题中获益匪浅。因此,这个问题吸引了产业和学院越来越多的关注。在本文中,我们提议了一个基于任务重复的学习算法,即\lacheis\foote{希腊古神话中三个胖子的第二个,目的是优化问题。在拟议办法中,数据中心首先通过强化学习框架和专门设计的图形神经网络(GNNN),可以选择最有希望的任务。然后,任务被指派给一个特定的执行者,在考虑复制其所有先例性算法的精细比率时, 也就是根据专家设计的规则, 进行一项大规模的实验性裁量。