Sparse General Matrix Multiply (SpGEMM) is key for various High-Performance Computing (HPC) applications such as genomics and graph analytics. Using the semiring abstraction, many algorithms can be formulated as SpGEMM, allowing redefinition of addition, multiplication, and numeric types. Today large input matrices require distributed memory parallelism to avoid disk I/O, and modern HPC machines with GPUs can greatly accelerate linear algebra computation. In this paper, we implement a GPU-based distributed-memory SpGEMM routine on top of the CombBLAS library. Our implementation achieves a speedup of over 2x compared to the CPU-only CombBLAS implementation and up to 3x compared to PETSc for large input matrices. Furthermore, we note that communication between processes can be optimized by either direct host-to-host or device-to-device communication, depending on the message size. To exploit this, we introduce a hybrid communication scheme that dynamically switches data paths depending on the message size, thus improving runtimes in communication-bound scenarios.


翻译:稀疏通用矩阵乘法(SpGEMM)是基因组学与图分析等多种高性能计算(HPC)应用中的关键运算。通过半环抽象,许多算法可表述为SpGEMM形式,从而允许重新定义加法、乘法及数值类型。当前大规模输入矩阵需采用分布式内存并行以避免磁盘I/O,而配备GPU的现代HPC机器能极大加速线性代数计算。本文在CombBLAS库基础上实现了基于GPU的分布式内存SpGEMM例程。相较于纯CPU的CombBLAS实现,我们的方案在大型输入矩阵上取得了超过2倍的加速比;与PETSc相比,加速比最高可达3倍。此外,我们注意到进程间通信可根据消息规模,通过主机间直连或设备间直连进行优化。为此,我们提出一种混合通信方案,该方案能依据消息规模动态切换数据传输路径,从而在通信受限场景中显著提升运行效率。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员