The energy efficiency of neural processing units (NPU) is playing a critical role in developing sustainable data centers. Our study with different generations of NPU chips reveals that 30%-72% of their energy consumption is contributed by static power dissipation, due to the lack of power management support in modern NPU chips. In this paper, we present ReGate, which enables fine-grained power-gating of each hardware component in NPU chips with hardware/software co-design. Unlike conventional power-gating techniques for generic processors, enabling power-gating in NPUs faces unique challenges due to the fundamental difference in hardware architecture and program execution model. To address these challenges, we carefully investigate the power-gating opportunities in each component of NPU chips and decide the best-fit power management scheme (i.e., hardware- vs. software-managed power gating). Specifically, for systolic arrays (SAs) that have deterministic execution patterns, ReGate enables cycle-level power gating at the granularity of processing elements (PEs) following the inherent dataflow execution in SAs. For inter-chip interconnect (ICI) and HBM controllers that have long idle intervals, ReGate employs a lightweight hardware-based idle-detection mechanism. For vector units and SRAM whose idle periods vary significantly depending on workload patterns, ReGate extends the NPU ISA and allows software like compilers to manage the power gating. With implementation on a production-level NPU simulator, we show that ReGate can reduce the energy consumption of NPU chips by up to 32.8% (15.5% on average), with negligible impact on AI workload performance. The hardware implementation of power-gating logic introduces less than 3.3% overhead in NPU chips.
翻译:神经处理单元(NPU)的能效对建设可持续数据中心至关重要。我们对多代NPU芯片的研究表明,由于现代NPU芯片缺乏功耗管理支持,其30%-72%的能耗来自静态功耗。本文提出ReGate,通过软硬件协同设计实现NPU芯片中各硬件组件的细粒度功耗门控。与传统通用处理器的功耗门控技术不同,NPU因其硬件架构与程序执行模型的根本差异,在实现功耗门控时面临独特挑战。为应对这些挑战,我们系统探究了NPU芯片各组件的功耗门控机会,并确定了最优功耗管理方案(即硬件管理与软件管理功耗门控)。具体而言:对于具有确定性执行模式的脉动阵列,ReGate依据其固有数据流执行特性,在处理单元粒度上实现周期级功耗门控;对于存在长空闲间隔的片间互连与HBM控制器,ReGate采用轻量级硬件空闲检测机制;对于空闲周期随工作负载模式大幅变化的向量单元与SRAM,ReGate扩展NPU指令集并允许编译器等软件管理功耗门控。在生产级NPU模拟器上的实现表明,ReGate最高可降低NPU芯片32.8%的能耗(平均15.5%),且对AI工作负载性能影响可忽略。功耗门控逻辑的硬件实现仅给NPU芯片带来不足3.3%的开销。