Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the potability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that PVC is up to 37% better than the A100 on the LULESH benchmark, presenting better performance in computing and data movements.
翻译:在过去十年中,大多数计算能力的增长都来自于加速的多核体系结构的进步,主要以GPGPU的形式出现。虽然加速器在各种计算任务中取得了卓越的性能,但它们的利用需要代码适应和转换。因此,OpenMP是科学计算应用程序中最常见的多线程标准,自v4.0以来引入了主机(CPU)和加速器之间的离线功能,在后续的v4.5、v5.0、v5.1和最新的v5.2版本中得到了越来越多的支持。最近,两种最先进的GPU——英特尔Pont Vecchio Max 1100和NVIDIA A100 GPU——发布到市场上,相应地实现了oneAPI和GNU LLVM支持的离线编译。在这项工作中,我们展示了OpenMP离线功能在这些设备上的早期性能结果,同时特别分析高级指令的可移植性(使用SOLLVE的OMPVV测试套件)以及代表性科学微型应用程序(LULESH基准测试)的硬件可扩展性。我们的结果显示,在v4.5和5.0中,绝大多数离线指令都得到了最新oneAPI和GNU编译器的支持;然而,在v5.1和v5.2中的支持仍然不足。从性能角度看,我们发现PVC在LULESH基准测试中比A100高出37%,在计算和数据移动方面表现更好。