使用CUDA观测的 MPI 计算高级Stenciil的高级Stencil 计算可缩放通信 (Scalable communication for high-order stencil computations using CUDA-aware MPI)

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to $64$ devices at $50\%$--$87\%$ efficiency in sixth-order stencil computations when the problem domain consists of $256^3$--$1024^3$ cells.

翻译：在高性能计算中,现代计算节点提供了巨大的平行和处理能力,然而,由于计算性能观察到相对于记忆和网络带宽而言的加速增长,优化数据流动对于许多通信重型应用软件的大幅扩展变得至关重要,这种性能差距随着图形处理器的引入而进一步加大,因为图形处理器的引入可以通过多种因素在数据平行任务中提供比中央处理器更高的传输量。在这项工作中,我们探索了迭代静态循环的计算方面,并实施了使用CUDA-aware MPI的通用通信计划,我们利用这一计划加速基于高级有限差异和三级Runge-Kutta整合的磁力动力学模拟。我们特别重视改进本地内部工作量地点。与理论性能模型相比,我们的实施显示,当问题领域由2.5636美元组成的六级Stencil计算时,六级电流动力学效率从1美元提高到64美元,相当于50美元-87美元-1024美元-3美元。