GROMACS 性能解构：GPU 功率限制与频率对性能的影响机制 (GROMACS Unplugged: How Power Capping and Frequency Shapes Performance on GPUs)

Molecular dynamics simulations are essential tools in computational biophysics, but their performance depend heavily on hardware choices and configuration. In this work, we presents a comprehensive performance analysis of four NVIDIA GPU accelerators -- A40, A100, L4, and L40 -- using six representative GROMACS biomolecular workloads alongside two synthetic benchmarks: Pi Solver (compute bound) and STREAM Triad (memory bound). We investigate how performance scales with GPU graphics clock frequency and how workloads respond to power capping. The two synthetic benchmarks define the extremes of frequency scaling: Pi Solver shows ideal compute scalability, while STREAM Triad reveals memory bandwidth limits -- framing GROMACS's performance in context. Our results reveal distinct frequency scaling behaviors: Smaller GROMACS systems exhibit strong frequency sensitivity, while larger systems saturate quickly, becoming increasingly memory bound. Under power capping, performance remains stable until architecture- and workload-specific thresholds are reached, with high-end GPUs like the A100 maintaining near-maximum performance even under reduced power budgets. Our findings provide practical guidance for selecting GPU hardware and optimizing GROMACS performance for large-scale MD workflows under power constraints.

翻译：分子动力学模拟是计算生物物理学中的重要工具，但其性能高度依赖于硬件选择与配置。本研究对四款 NVIDIA GPU 加速器——A40、A100、L4 和 L40——进行了全面的性能分析，使用了六个具有代表性的 GROMACS 生物分子工作负载以及两个合成基准测试：Pi Solver（计算密集型）和 STREAM Triad（内存密集型）。我们探究了性能如何随 GPU 图形时钟频率变化，以及工作负载对功率限制的响应。两个合成基准测试定义了频率扩展的极端情况：Pi Solver 展现出理想的计算可扩展性，而 STREAM Triad 揭示了内存带宽限制——从而为 GROMACS 的性能提供了参照框架。我们的结果揭示了不同的频率扩展行为：较小的 GROMACS 系统表现出强烈的频率敏感性，而较大的系统则迅速达到饱和，变得越来越受内存限制。在功率限制下，性能在达到特定于架构和工作负载的阈值之前保持稳定，高端 GPU（如 A100）即使在降低的功率预算下也能维持接近最大性能。我们的研究结果为在功率限制下选择 GPU 硬件以及优化大规模分子动力学工作流程中的 GROMACS 性能提供了实用指导。