Training deep neural networks (DNNs) is becoming more and more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency. In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically finding optimal job- and GPU-level configurations for recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 15.3%--75.8% for diverse workloads.
翻译:不幸的是,现有的工程主要侧重于优化DNN培训,以便更快地完成培训,往往不考虑对能源效率的影响。在本论文中,我们观察到,提高培训绩效的共同做法往往会导致能源使用效率低下。更重要的是,我们证明能源消耗与优化性能之间存在着平衡。为此,我们提议一个优化框架,Zeus,通过自动找到对经常性DNN培训工作的最佳工作配置和GPU级别配置来引导这一平衡。Zeus采用在线探索开发方法,同时及时进行能源状况分析,避免对昂贵的离线测量的需求,同时适应数据随时间的漂移。我们的评估表明,Zeus可以提高DNN培训的能效,在多种工作量中,15.3%至75.8%。