We explore the performance and portability of the high-level programming models: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership Computing Facilities. We compare the default performance of a hand-rolled dense matrix multiplication algorithm on CPUs against vendor-compiled C/OpenMP implementations, and on each GPU against CUDA and HIP. Rather than focusing on the kernel optimization per-se, we select this naive approach to resemble exploratory work in science and as a lower-bound for performance to isolate the effect of each programming model. Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs. Performance gaps are identified on NVIDIA A100 GPUs for Julia's single precision and Kokkos, and for Python/Numba in all scenarios. We also comment on half-precision support, productivity, performance portability metrics, and platform readiness. We expect to contribute to the understanding and direction for high-level, high-productivity languages in HPC as the first-generation exascale systems are deployed.
翻译:我们探索高水平编程模式的性能和可携带性:基于LLVM的Julia和Python/Numba的高性能计算节点的LLVM Julia和Python/Numba,以及Kokkos的高性能计算(HPC)节点:AMD Epyc CPUs和MI250X图形处理器(GPUs)的性能和可移动性:Frontier的测试床压碎器系统和Ampere的Amper CPUs和NVIDIA的A100GPUs系统。我们比较了CUDA和GOMP的手动密集矩阵倍增速计算器的默认性性能和倍增速算法的多重性能。 我们选择这种天真的方法来模仿科学探索性工作,作为孤立每个编程模型效果的低度标准。朱丽亚和科在CPUDA和GPIPS上与CUPS的高度性能理解性能,我们在HA/PIPIP/半级平台上,我们对HIA/PA的预期性平级的成绩理解,我们对HIA/PLVA/PS-S-S-S-S-S-S-S-PS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Siralental-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-PL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-</s>