Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.
翻译:满足可扩缩性和性能可移植性要求对于任何高常委会应用都是一种挑战,特别是对于适应性改进型的高常委会应用而言。在模拟星型合并的天体物理学应用Octo-Tiger中,我们用现有解决方案来处理这个问题:我们使用HPX来获得细微的细微任务,以便容易地分配工作,并细微地重叠通信和计算。对于计算本身,我们利用Kokkos将这些任务转化为能够从几个CPU核心到强大的加速器等硬件运行的计算内核。然而,有一个缺失的环节:虽然HPX暴露的细微的平行法有助于可扩缩性,但是当任务变得太小而不能饱和,造成资源利用率低时,它会妨碍GPU的性能。为了缩小这一差距,我们调查了奥克-泰热内部的多个不同的GPU工作汇总战略,添加了一个新的战略,并评估对最近的AMD和NVIDIA GPU的节级业绩影响,从而实现明显的速度。</s>