面向高性能量子模拟的低层与NUMA感知优化 (Low-Level and NUMA-Aware Optimization for High-Performance Quantum Simulation)

Scalable classical simulation of quantum circuits is crucial for advancing quantum algorithm development and validating emerging hardware. This work focuses on performance enhancements through targeted low-level and NUMA-aware tuning on a single-node system, thereby not only advancing the efficiency of classical quantum simulations but also establishing a foundation for scalable, heterogeneous implementations that bridge toward noiseless quantum computing. Although few prior studies have reported similar hardware-level optimizations, such implementations have not been released as open-source software, limiting independent validation and further development. We introduce an open-source, high-performance extension to the QuEST state vector simulator that integrates state-of-the-art low-level and NUMA-aware optimizations for modern processors. Our approach emphasizes locality-aware computation and incorporates hardware-specific techniques including NUMA-aware memory allocation, thread pinning, AVX-512 vectorization, aggressive loop unrolling, and explicit memory prefetching. Experiments demonstrate substantial speedups--5.5-6.5x for single-qubit gate operations, 4.5x for two-qubit gates, 4x for Random Quantum Circuits (RQC), and 1.8x for the Quantum Fourier Transform (QFT). Algorithmic workloads further achieve 4.3-4.6x acceleration for Grover and 2.5x for Shor-like circuits. These results show that systematic, architecture-aware tuning can significantly extend the practical simulation capacity of classical quantum simulators on current hardware.

翻译：量子电路的可扩展经典模拟对于推动量子算法开发和验证新兴硬件至关重要。本研究聚焦于在单节点系统上通过针对性的低层与NUMA感知调优实现性能提升，不仅推进了经典量子模拟的效率，还为构建可扩展、异构化的实现方案奠定基础，从而向无噪声量子计算迈进。尽管此前少数研究报道过类似的硬件级优化，但这些实现均未作为开源软件发布，限制了独立验证与后续发展。我们为QuEST态矢量模拟器引入了一个开源高性能扩展，集成了面向现代处理器的最先进低层与NUMA感知优化技术。我们的方法强调局部性感知计算，并融合了包括NUMA感知内存分配、线程绑定、AVX-512向量化、激进循环展开和显式内存预取在内的硬件专用技术。实验表明性能获得显著提升：单量子比特门操作加速5.5-6.5倍，双量子比特门加速4.5倍，随机量子电路（RQC）加速4倍，量子傅里叶变换（QFT）加速1.8倍。算法工作负载进一步实现Grover算法4.3-4.6倍加速和类Shor电路2.5倍加速。这些结果表明，系统化的架构感知调优能够显著扩展经典量子模拟器在当前硬件上的实际模拟能力。