Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue.
翻译:从 NVIDIDA 的 GPU 到 AMD GPU 和 Intel GPU : 鉴于当前超级计算机中可用的加速器卡片的异质性能, 移动性是现代 HPC 应用的关键方面。 在 Octo- Tigger 中, 我们依靠 Kokkos 及其各种执行空间来协调便携式计算内核内核。 否则, 我们用 HPX 来协调内核发射、 CPU 任务和通信。 这种组合使我们能够在便携式 CHPPP/ GP 计算和通信之间有一个精细的插接点, 使各种超级计算机中的现有加速器卡卡卡卡能够伸缩。 然而, 对于 HPX 和 Kokkos 来说, 我们的兼容性能可以将KFCLO 的内核内核内核内核作为HPSLO 任务处理。 否则, 我们不得不用电算内核内核内核内核内核内核内核内核内核的内核内核内核内核内核内核内核内核内核内核内核内核内核 。</s>