Asynchronous tasks, when created with overdecomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to fine-grained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method on GPUs, Jacobi3D. In addition to optimizations for minimizing host-device synchronization and increasing the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to combat fine-grained overheads at scale.
翻译:自动计算通信重叠,可以大大改善功能和可缩放性。这不仅适用于传统的基于CPU的系统,而且适用于现代的GPU加速平台。虽然在微弱的缩放假设中,将通信隐藏在计算背后的能力会非常有效,但由于细微缩小的间接费用和减少重叠的空间,业绩开始遇到较小的问题,或大大缩小。在这项工作中,我们除了计算通信重叠之外,还将GPU-觉悟通信纳入非同步任务,目的是减少通信时间,并进一步提高GPU的利用率。我们用一个代理应用程序展示我们方法的绩效影响,该应用程序在GPU(Jacobi3D)上执行Jacobi迭代方法。除了最大限度地减少主控设备同步和增加GPU业务的调值外,我们还探索诸如内核聚和CUDA图等技术,以打击规模的微缩放间接费用。