As is intrinsic to the fundamental goal of quantum computing, classical simulation of quantum algorithms is notoriously demanding in resource requirements. Nonetheless, simulation is critical to the success of the field and a requirement for algorithm development and validation, as well as hardware design. GPU-acceleration has become standard practice for simulation, and due to the exponential scaling inherent in classical methods, multi-GPU simulation can be required to achieve representative system sizes. In this case, inter-GPU communications can bottleneck performance. In this work, we present the introduction of MPI into the QED-C Application-Oriented Benchmarks to facilitate benchmarking on HPC systems. We review the advances in interconnect technology and the APIs for multi-GPU communication. We benchmark using a variety of interconnect paths, including the recent NVIDIA Grace Blackwell NVL72 architecture that represents the first product to expand high-bandwidth GPU-specialized interconnects across multiple nodes. We show that while improvements to GPU architecture have led to speedups of over 4.5X across the last few generations of GPUs, advances in interconnect performance have had a larger impact with over 16X performance improvements in time to solution for multi-GPU simulations.
翻译:正如量子计算的基本目标所固有的,量子算法的经典仿真在资源需求方面是众所周知的苛刻。然而,仿真对于该领域的成功至关重要,是算法开发与验证以及硬件设计的必要条件。GPU加速已成为仿真的标准实践,并且由于经典方法固有的指数级扩展,可能需要多GPU仿真才能达到具有代表性的系统规模。在这种情况下,GPU间通信可能成为性能瓶颈。在本工作中,我们将MPI引入QED-C面向应用的基准测试中,以便于在HPC系统上进行基准测试。我们回顾了互连技术的进展以及用于多GPU通信的API。我们使用多种互连路径进行基准测试,包括最近推出的NVIDIA Grace Blackwell NVL72架构,该架构是首个将高带宽GPU专用互连扩展到多节点的产品。我们表明,虽然GPU架构的改进使得过去几代GPU实现了超过4.5倍的加速,但互连性能的进步产生了更大的影响,为多GPU仿真的求解时间带来了超过16倍的性能提升。