Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
翻译:Octo-Tiger, 一个用于合并恒星的大型 3D AMR 代码, 使用HPX、 Kokkos 和明确的 SIMD 类型的组合, 目的是实现多种不同硬件的性能移动。 然而, 在 A64FX CPU 上, 我们遇到了几个缺失的碎片, 给SIMD 矢量化造成问题, 从而阻碍了性能。 因此, 我们添加了 std: 实验: 实验: 模拟: 将Octo- Tigger 的 Kokkos 内核与 Kokkos SIMD 一起用作一个选项, 进一步添加一个新的 SVE (可缩放矢量扩展) SIMD 后端。 此外, 我们修正了在 Octo- Tiger 的水溶解器中 Kokkos 内核内空的 SIMD 执行系统缺失的 SIMD 。 我们通过在三种不同的 CPU: An64FX 、 Intel Icelake 和 AM CPUP 上有两个清晰的CP 。