We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.
翻译:----
本文详细介绍了rocHPL的性能优化,rocHPL是AMD针对异构节点架构设计的超级计算机Frontier等超级计算机的开源实现。该实现利用高吞吐量的GPU加速器进行高度优化的线性代数库计算,并通过整个CPU插座执行延迟敏感的分解阶段。本文详细介绍了一些性能改进,例如在CPU上计算面板因式分解阶段的多线程方法、在节点上进程之间的CPU核的时间共享、以及隐藏MPI通信的几个优化。我们展示了这个HPL基准的实现在奥克岭国家实验室Frontier早期访问集群的单个节点上的一些性能结果,以及多个节点的扩展。