As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.1x in Charm++, 11.7x in AMPI, and 17.4x in Charm4py. We also observe increases in bandwidth of up to 10.1x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.
翻译:由于越来越多的领导阶层系统在加速加速加速加速GPU的竞赛中包括了GPU加速器,有效的GPU数据通信正在成为高性能计算的最关键组成部分之一。对于平行编程模型的开发者来说,利用本地API支持GPU-aware通信,例如CUDA,是一项艰巨的任务,因为它需要付出相当大的努力,但几乎没有业绩保障。在这项工作中,我们展示了统一通信X(UCX)框架的能力,以组成一个GPU-awe通信层,为Charm++生态系统的多个平行编程模型:Charm++、适应性MPI和Charm4py。我们展示了我们设计与从OSU基准套改编成的微音标记的微音节点通信的性影响,在Charm++、AMPI和Charm4+VER的17.4中,我们通过在Charmexal-Ambreal应用系统改进了我们12个SyPI软件的频率应用,我们展示了对12个Charx的Ambreal-SAL应用的调化技术的潜在影响。