DIAMOND：面向量子模拟的稀疏矩阵乘法脉动阵列加速 (DIAMOND: Systolic Array Acceleration of Sparse Matrix Multiplication for Quantum Simulation)

Hamiltonian simulation is a key workload in quantum computing, enabling the study of complex quantum systems and serving as a critical tool for classical verification of quantum devices. However, it is computationally challenging because the Hilbert space dimension grows exponentially with the number of qubits. The growing dimensions make matrix exponentiation, the key kernel in Hamiltonian simulations, increasingly expensive. Matrix exponentiation is typically approximated by the Taylor series, which contains a series of matrix multiplications. Since Hermitian operators are often sparse, sparse matrix multiplication accelerators are essential for improving the scalability of classical Hamiltonian simulation. Yet, existing accelerators are primarily designed for machine learning workloads and tuned to their characteristic sparsity patterns, which differ fundamentally from those in Hamiltonian simulations that are often dominated by structured diagonals. In this work, we present \name, the first diagonal-optimized quantum simulation accelerator. It exploits the diagonal structure commonly found in problem-Hamiltonian (Hermitian) matrices and leverages a restructured systolic array dataflow to transform diagonally sparse matrices into dense computations, enabling high utilization and performance. Through detailed cycle-level simulation of diverse benchmarks in HamLib, \name{} demonstrates average performance improvements of $10.26\times$, $33.58\times$, and $53.15\times$ over SIGMA, Outer Product, and Gustavson's algorithm, respectively, with peak speedups up to $127.03\times$ while reducing energy consumption by an average of $471.55\times$ and up to $4630.58\times$ compared to SIGMA.

翻译：哈密顿量模拟是量子计算中的关键工作负载，既能用于研究复杂量子系统，也是经典设备验证量子器件的核心工具。然而，由于希尔伯特空间维度随量子比特数量呈指数增长，其计算极具挑战性。不断增长的维度使得哈密顿量模拟的核心计算内核——矩阵指数运算——代价日益高昂。矩阵指数运算通常通过泰勒级数近似，该级数包含一系列矩阵乘法运算。由于厄米算子通常具有稀疏性，稀疏矩阵乘法加速器对于提升经典哈密顿量模拟的可扩展性至关重要。然而，现有加速器主要针对机器学习工作负载设计，并针对其特征稀疏模式进行优化，这些模式与哈密顿量模拟中常见的结构化对角主导稀疏模式存在本质差异。本文提出\name，首个面向量子模拟的对角结构优化加速器。它利用问题哈密顿量（厄米）矩阵中普遍存在的对角结构，通过重构脉动阵列数据流将对角稀疏矩阵转化为稠密计算，从而实现高利用率和性能。通过对HamLib中多样化基准测试的详细周期级模拟，\name相较于SIGMA、外积算法和Gustavson算法分别实现了平均$10.26\times$、$33.58\times$和$53.15\times$的性能提升，峰值加速比最高达$127.03\times$；同时与SIGMA相比，平均能耗降低$471.55\times$，最高可达$4630.58\times$。