With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.
翻译:随着近年来AI值的上升和模型复杂性的提高,计算资源中日益增长的需求开始成为一个重大挑战。因此,对更高计算能力的需求正在通过越来越强大的加速器和大型计算群群的使用来满足。然而,在分布式和加速式系统上培训的大型模型在预测准确性方面的增益是以能源需求大幅增加为代价的,研究人员已开始质疑此类AI方法的规模是否有利于环境。因此,能源效率对AI模型开发商和基础设施操作商都起着重要作用。AI工作量的能源消耗取决于模型的实施和所使用的硬件。因此,对不同类型计算节点的AI工作流程的电流的准确测量对于算法改进和未来计算群和硬件的设计是关键。为此,我们对两种典型的关于不同类型计算型号的深度学习模型的能源消耗量进行了测量。我们的结果表明,1. 直接从运行期获得能源的能源消耗量并不是准确的,但对于计算节能的消耗量的消耗量的消耗量取决于模型的落实情况以及所使用的硬件的使用量。 因此,对于不同类型计算节能模式的计算效率的准确性数据是无法考虑的准确性。 3,在能源消耗培训中,在成本上不考虑一种比标准比能源效率比标准比能源培训中,在能源使用型中,在能源使用型中,在能源使用型中,在能源成本上,在能源使用中,在能源的计算上是比比值上是分开的比值值值值。