Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion Time (JCT) under an energy budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance with different configurations. Based on the performance models, PowerFlow dynamically allocates GPUs and adjusts the GPU-level or job-level configurations of DL training jobs. PowerFlow applies network packing and buddy allocation to job placement, thus avoiding extra energy consumed by cluster fragmentations. Evaluation results show that under the same energy consumption, PowerFlow improves the average JCT by 1.57 - 3.39 x at most, compared to competitive baselines.
翻译:训练深度神经网络(DNNs)是当今数据中心的主要负载工作之一,导致能源消耗呈急剧增长。在数据中心中,减少能源消耗的同时尽早完成DL训练工作非常重要。在本文中,我们提出了一种名为PowerFlow的GPU集群调度器,它可以在能源预算下缩短平均工作完成时间(JCT)。我们首先针对DL训练作业提出性能模型,预测不同配置下的吞吐量和能耗性能。基于性能模型,PowerFlow动态分配GPU并调整GPU或作业级别的DL训练作业配置。PowerFlow使用网络打包和伙伴分配来进行作业放置,从而避免由集群分段产生的额外能源消耗。评估结果表明,在相同的能源消耗下,与竞争基准相比,PowerFlow将平均的JCT提高了最多1.57-3.39倍。