JACACC: 开放的ACACC运行时间框架,核心一级和多保组平行化 (JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization)

from arxiv, Extended version of a paper to appear in: Proceedings of the 28th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), December 17-18, 2021

The rapid development in computing technology has paved the way for directive-based programming models towards a principal role in maintaining software portability of performance-critical applications. Efforts on such models involve a least engineering cost for enabling computational acceleration on multiple architectures while programmers are only required to add meta information upon sequential code. Optimizations for obtaining the best possible efficiency, however, are often challenging. The insertions of directives by the programmer can lead to side-effects that limit the available compiler optimization possible, which could result in performance degradation. This is exacerbated when targeting multi-GPU systems, as pragmas do not automatically adapt to such systems, and require expensive and time consuming code adjustment by programmers. This paper introduces JACC, an OpenACC runtime framework which enables the dynamic extension of OpenACC programs by serving as a transparent layer between the program and the compiler. We add a versatile code-translation method for multi-device utilization by which manually-optimized applications can be distributed automatically while keeping original code structure and parallelism. We show in some cases nearly linear scaling on the part of kernel execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the resulting performance improvements amortize the latency of GPU-to-GPU communications.

翻译：计算机技术的迅速发展为基于指令的编程模式铺平了道路,从而在维持软件的可操作性方面起到主要作用,使基于性能关键应用程序的可操作性得以保持软件的可操作性。关于这类模型的努力涉及一个最小的工程成本,使多结构的计算加速,而程序员只需在顺序代码中添加元信息即可。不过,实现尽可能最佳效率的优化往往具有挑战性。程序员插入指令可能会产生副作用,从而限制现有编程优化的可能,从而导致性能退化。当针对多GPU系统时,这种情况就更加严重了。因为软体不自动适应这些系统,需要程序员花费昂贵和耗时的代码调整。本文介绍了JACC,即开放ACC运行时间框架,它通过作为程序与编程之间的透明层,使开放ACC方案能够动态扩展。我们为多功能利用程序程序添加了一种通用的编码翻译方法,可以自动分配手动优化应用程序,同时保持原有的代码结构和平行性。我们在一些案例中,我们展示了与NVIA VPI的多功能的升级的改进,从而使GPI的改进了G-PI-PI-PI-PI-S的改进。我们的改进工作。