Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement in a shared memory benchmark and up to 10 times in distributed device communication. Preliminary results indicate that our software incurs low overhead and achieves 40% improvement in a distributed Jacobi proxy application while hiding the idiosyncrasies of the hardware.
翻译:大型系统目前配备了每个计算节点的多个 GPU 加速器,预计今后将采用更专门的硬件。计算生态系统的这一转变为改进性能提供了许多机会;然而,它也增加了这类结构的编程复杂性。这项工作引入了一个运行时间框架,允许为多种系统制定不费力的编程,同时有效利用硬件资源。框架被纳入一个分布式和可缩放的运行时间系统,以便利跨不同节点的性能可移动性。除了设计外,本文件还描述了实施和优化情况,在共享记忆基准方面实现了高达300%的改进,在分布式设备通信方面实现了高达10次的改进。初步结果显示,我们的软件在隐藏硬件特性的同时,在分布式的Jacobi代用应用程序上产生低间接费用,并实现了40%的改进。