Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to $64$ devices at $50\%$--$87\%$ of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits $20$--$60\times$ speedup and $9$--$12\times$ improved energy efficiency in compute-bound benchmarks on $16$ nodes.
翻译:在高性能计算中,现代计算节点提供了巨大的平行和处理能力,然而,由于计算性能观察到与记忆和网络带宽相比以更快的速度增长,优化数据流动对于许多通信重型应用软件的大幅扩展至关重要;由于引入了图形处理器,这种性能差距随着图形处理器的引入而进一步加大,图形处理器在数据平行任务中可提供比中央处理器在数据平行任务中更高的多重要素。在这项工作中,我们探索了迭接静环的计算方面,并运用CUDA-aware MPI实施了一个通用通信计划,我们利用CUDA-aware MPI加快磁水力动力学模拟,基于高级有限差异和三级Runge-Kutta整合,我们利用这个计划加快了磁力学动力学模拟。我们特别注重改善本地内部工作量地点。我们的GPU实施比例从1到64美元,相当于根据理论性能模型预期效率的50 $-87 美元。与多核心CPU解器相比,我们的实施显示,我们以20-60美元的速度速度每10美元的速度速度和9美元标准美元为美元。