Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy and address the issue of workload imbalance. To tackle the challenge of multi-objective scheduling, i.e., maximizing GPU utilization while reducing operational costs, we propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities. Compared with other algorithms, our proposed method improves the system utility by up to 28.6% attributable to higher GPU utilization, lower energy cost, and less carbon emission.
翻译:最近生成式人工智能的突破引发了机器学习培训需求的激增,这会由于其大量的能源消耗而产生显著的成本负担和环境挑战。将训练作业在地理分布的云数据中心之间进行调度,揭示了利用由廉价、低碳能源驱动的计算容量来优化使用并解决工作负载不平衡问题的机会。为了解决多目标调度的挑战,即最大化GPU利用率同时降低运营成本,我们提出了一种基于多智能体强化学习和演员-评论家方法的算法,通过与真实工作负载模式、能源价格和碳强度构建的云系统进行交互来学习最佳的协作调度策略。与其他算法相比,我们提出的方法通过提高GPU利用率、降低能源成本和减少碳排放使系统效用提高了28.6%。