Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose Zeus, an optimization framework to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%-75.8% for diverse workloads.
翻译:不幸的是,现有的工程主要侧重于优化DNN培训,以便更快完成,往往没有考虑到对能源效率的影响。在本论文中,我们观察到,提高培训绩效的共同做法往往会导致能源使用效率低下。更重要的是,我们证明能源消耗与优化性能之间存在着平衡。为此,我们提议宙斯为这一权衡提供一个优化框架,为DNN的经常性培训工作自动找到最佳的工作和GPU级别配置。宙斯采用在线探索开发方法,同时及时进行能源状况分析,避免对昂贵的离线测量的需求,同时适应数据随时间的漂移。我们的评估表明,宙斯可以提高DNN培训的能效,在各种工作量中提高15.3%至75.8%。