Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in numerous high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in the substantial communication overhead, which limits both performance and scalability. In this paper, we identify and analyze sources of inefficient communication in existing distributed SpMM implementations at two levels and address these inefficiencies by proposing: (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that integrates the sparsity-aware strategy with the common two-tier network architectures in GPU-accelerated systems, to reduce redundant communication across slow network links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that our framework demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.


翻译:分布式稀疏矩阵-矩阵乘法(SpMM)是众多高性能计算和深度学习应用中的基本运算。分布式SpMM的主要性能瓶颈在于巨大的通信开销,这限制了其性能和可扩展性。本文在现有分布式SpMM实现中识别并分析了两个层面的低效通信来源,并通过以下方法解决这些低效问题:(1)一种细粒度的、稀疏性感知的通信策略,通过利用稀疏矩阵的稀疏模式来减少通信开销;(2)一种分层通信策略,将稀疏性感知策略与GPU加速系统中常见的双层网络架构相结合,以减少跨低速网络链路的冗余通信。我们在一个全面的分布式SpMM框架\method{}中实现了这些优化。在真实数据集上的广泛评估表明,我们的框架在多达128个GPU上展现出强大的可扩展性,在此规模下,相较于四种最先进的基线方法(分别为CAGNET、SPA、BCL和CoLa),实现了221.5$\times$、56.0$\times$、23.4$\times$和8.8$\times$的几何平均加速比。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员