With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel processing. Large message size in MPI collectives is a particularly big concern because it may significantly delay the overall parallel performance. To address this issue, prior research simply applies the off-the-shelf fix-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called C-Coll, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication cost. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication cost but also preserves data accuracy. (2) We customize an optimized version based on SZx, an ultra-fast error-bounded lossy compressor, which can meet the specific needs of collective communication. (3) We integrate C-Coll into multiple collectives, such as MPI_Allreduce, MPI_Scatter, and MPI_Bcast, and perform a comprehensive evaluation based on real-world scientific datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines and related efforts by 3.5-9.7X.
翻译:随着超级计算机的计算能力和科学应用的规模不断增加,MPI集体通信的效率成为大规模分布式和并行处理中的关键瓶颈。MPI集体中的大消息大小是一个特别大的问题,因为它可能会显著延迟整体并行性能。为了解决这个问题,之前的研究简单地在MPI集合中应用现成的固定速率有损压缩器,导致性能不佳,通用性有限,且存在无界误差。在本文中,我们提出了一种名为C-Coll的新解决方案,将误差有界的无损压缩应用到MPI集合中,从而显著减少消息大小,大大降低了通信成本。主要贡献有三个:(1)我们构建了两种基于优化有损压缩的通用框架,用于MPI集合的两种类型(集体数据移动和集体计算),并根据其特定特征进行了优化。我们的框架不仅降低了通信成本,还保持了数据的准确性。(2)我们定制了一个基于SZx的优化版本,SZx是一种超快速的误差有界无损压缩器,可以满足集体通信的特定需求。(3)我们将C-Coll整合到多个集合中,如MPI_Allreduce,MPI_Scatter和MPI_Bcast,并基于真实世界的科学数据集进行了全面评估。实验表明,我们的解决方案比原始MPI集合以及多个基线和相关工作快3.5-9.7倍。