The increase in computation and storage has led to a significant growth in the scale of systems powering applications and services, raising concerns about sustainability and operational costs. In this paper, we explore power-saving techniques in high-performance computing (HPC) and datacenter networks, and their relation with performance degradation. From this premise, we propose leveraging Energy Efficient Ethernet (EEE), with the flexibility to extend to conventional Ethernet or upcoming Ethernet-derived interconnect versions of BXI and Omnipath. We analyze the PerfBound proposal, identifying possible improvements and modeling it into a simulation framework. Through different experiments, we examine its impact on performance and determine the most appropriate interconnect. We also study traffic patterns generated by selected HPC and machine learning applications to evaluate the behavior of power-saving techniques. From these experiments, we provide an analysis of how applications affect system and network energy consumption. Based on this, we disclose the weakness of dynamic power-down mechanisms and propose an approach that improves energy reduction with minimal or no performance penalty. To our knowledge, this is the first power management proposal tailored to future Ethernet-based HPC architectures, with promising results.
翻译:计算与存储需求的增长导致支撑应用与服务的系统规模显著扩大,引发了人们对可持续性与运营成本的担忧。本文探讨高性能计算(HPC)与数据中心网络的节能技术及其与性能下降的关系。基于此前提,我们提出利用能效以太网(EEE)的节能机制,并具备扩展至传统以太网或即将推出的BXI与Omnipath等以太网衍生互连版本的灵活性。我们分析了PerfBound方案,识别其可能的改进方向,并将其建模至仿真框架中。通过多组实验,我们检验了该方案对性能的影响并确定了最适宜的互连技术。同时,我们研究了选定的HPC与机器学习应用生成的流量模式,以评估节能技术的实际表现。基于这些实验,我们分析了应用程序如何影响系统及网络的能耗。据此,我们揭示了动态降功耗机制的不足,并提出一种能在最小化或零性能损失前提下提升节能效果的新方法。据我们所知,这是首个针对未来基于以太网的HPC架构量身定制的功耗管理方案,并展现出具有前景的实验结果。