GPUs are widely used to accelerate the training of machine learning workloads. As modern machine learning models become increasingly larger, they require a longer time to train, leading to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the training iteration change and only collects performance counter data when an iteration shift is detected. GPOEO employs multi-objective models based on gradient boosting and a local search algorithm to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites running on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with a modest average execution time increase of 5.1%.
翻译:GPOEO 动态地决定最佳能源配置,采用新技术进行在线测量、多目标预测模型和搜索优化。GPOEO 描述目标工作量行为的特点,GPOEO使用GPU性能计数计数计数计数。为了减少反性能剖析管理,它使用分析模型来检测培训循环变化,并且只在发现循环转换时收集反性能数据。 GPOEO采用基于梯度加速和本地搜索算法的多目标模型来寻找执行时间与能源消耗之间的权衡。我们通过将GPOEO用于NVIAA RTX3080Ti GPU两个基准套件的71个机能学习工作量来评估GPO。与NVIDIA 默认调度战略相比,GPOEO提供平均节能率16.2%,平均执行时间增加51%。