Sparse general matrix multiplication (SpGEMM) is an important and expensive computation primitive in many real-world applications. Due to SpGEMM's inherent irregularity and the vast diversity of its input matrices, developing high-performance SpGEMM implementation on modern processors such as GPUs is challenging. The state-of-the-art SpGEMM libraries (i.e., $nsparse$ and $spECK$) adopt several algorithms to tackle the challenges of global load balance, local load balance, and allocation of the result matrix. While these libraries focus on the high-level algorithm design for SpGEMM, they neglect several low-level architecture-specific optimizations, which causes inefficient implementations in their libraries. In this paper, we classify their inefficient implementations into seven categories. Based on our observations, we propose a highly optimized SpGEMM library called $OpSparse$. The optimizations in $OpSparse$ include 1) optimizing the binning method by improving the utilization of the shared memory, 2) optimizing the hashing method by reducing the access to the hash table, 3) improving the trade-off between hash collision rate and hardware utilization in the hashing method by setting appropriate binning ranges, 4) reducing the overheads of global memory utilization by minimizing the global memory usage of the metadata, and 5) improving the execution parallelism by overlapping global memory allocation with kernel execution. Performance evaluations with 26 commonly used matrices on an Nvidia Tesla V100 GPU show that $OpSparse$ achieves up to $27.8\times$, $1.81\times$, and $2.04\times$ performance speedup over three state-of-the-art libraries: $cuSPARSE$, $nsparse$, and $spECK$, respectively.
翻译:SpGEMM (SpGEMM) 是许多现实世界应用程序中一个重要而昂贵的计算方法。 由于 SpGEMM 的内在不规则性及其投入矩阵的丰富多样性, 使 SpGEMM 在诸如 GPUs 等现代处理器上执行高绩效 SpGEMM 具有挑战性。 SpGEMM 图书馆( 即 美元 和 美元 和 美元 SPECK ) 采用多种算法来应对全球负载平衡、 本地负载平衡和结果矩阵分配等挑战。 虽然这些图书馆侧重于 SpGEMM 的高级算法设计, 但却忽略了多个低层次的架构优化, 导致其图书馆执行效率低下。 在本文中,我们根据我们的意见, 高优化 SpGEMM 图书馆( 即 美元 和 美元 美元 美元 ), 优化 OpSpare, 包括1 改进对共享记忆的利用, 实现优化 方法, 减少 SGGSOal- missionalalalal lexalalalalalal livesessal 的利用, ral- dreal 和 4 使用全球内部平流法 。