Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.
翻译:虽然矩阵乘法在计算线性代数中起着关键作用,但对于接近偏差矩阵的矩阵乘法,几乎没有什么有效的解决办法。 粗略近距矩阵乘法( SpAMM) 是填补因传统优化而忽略的功能差距的算法之一, 用于密度/ 偏差矩阵乘法。 但是, 现有的 SpAM 算法未能利用 GPU 的性能潜力加速。 在本文中, 我们介绍了 CuSpAM, 这是为多个 GPU 优化的第一个平行的 SpAM 算法。 已经提出了几项性能优化, 包括用于适应线性平行的算法设计、 阻断存储访问优化战略, 以及加固核心加速等。 此外, 我们用有效的负载平衡方案将 cuspAM 调整成多个 GPUPS 。 我们评估了多个 GPUPU 的合成和真实世界数据集的 cuSpAM 。 实验结果表明, CospAMM 与供应商优化的 CUBLAS 和 cuPARSE 库室相比, 取得了显著的业绩加速。