Due to ill performance on many devices, sparse matrix-vector multiplication (SpMV) normally requires special care to store and tune for a given device. However, SpMV is one of the most important kernels in high-performance computing (HPC), and therefore, a storage format and tuning are required that allows for efficient SpMV operations with low memory and tuning overheads across heterogeneous devices. Additionally, the primary users of SpMV operations in HPC are normally application scientists that already have numerous other libraries they depend on the use of some standard sparse matrix storage format. As such, the ideal heterogeneous format would also be something that could easily be understood and requires no major changes to be translated into a standard sparse matrix format, such as compressed sparse row (CSR). This paper presents a heterogeneous format based on CSR, named CSR-k, that can be tuned quickly, requires minimal memory overheads, outperforms the average performance of NVIDIA's cuSPARSE and Sandia National Laboratories' KokkosKernels, while being on par with Intel MKL on our test suite. Additionally, CSR-k does not need any conversion to be used by standard library calls that require a CSR format input. In particular, CSR-k achieves this by grouping rows into a hierarchical structure of super-rows and super-super-rows that are represented by just a few extra arrays of pointers (i.e., <2.5% memory overhead to keep arrays for both GPU and CPU execution). Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super-super-rows sizes in constant time. We observe in this paper that CSR-k can achieve about 17.3% improvement on an NVIDIA V100 and about 18.9% improvement on an NVIDIA A100 over NVIDIA's cuSPARSE while still performing on-par with Intel MKL on an Intel Xeon Platinum 8380 and an AMD Epyc 7742.
翻译:由于许多装置的性能不佳, 分散的矩阵- Victor 倍增( SpMV) 通常需要特殊小心地存储和调控给定设备。 然而, SpMV 是高性能计算( HPC) 中最重要的内核, 因此, 需要一种存储格式和调试, 以便高效的 SmMV 操作, 低内存和调调控不同设备。 此外, HPC 的 SpMV 操作的主要用户通常是应用科学家, 这些科学家已经拥有许多其他图书馆, 他们需要使用某些标准性能稀释的矩阵存储格式。 因此, 理想的混杂格式也会很容易被理解, 不需要重大更改, 例如压缩的 稀释行( CSR) 。 本文提供了一种基于 CSR (CSR) 的混杂格式, 需要最起码的记忆管理器, 超过 NVIDIA 模型 18 和 Sandia 国家实验室的平均水平。 在测试室里, 和 Kokkos Kernels 上, 需要与 Intel MIL 一起运行 IML, 。 此外,, 需要使用一个特定的 CSR 和 C- dreal 格式进行一个 C- 一种 C- dex- sold- 。