并非所有 GPU 都相等: 指定大型、加速器- Rich 系统中的可变性 (Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems)

from arxiv, 14 pages, 18 figures, to appear at The 34th International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '22)

Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-scale levels of compute for scientific workloads. Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU (stock keeping unit). This variation occurs due to manufacturing variability and the chip's PM. However, while modern HPC systems widely employ accelerators such as GPUs, it is unclear how much this variability affects applications. Accordingly, we seek to characterize the extent of variation due to GPU PM in modern HPC and supercomputing systems. We study a variety of applications that stress different GPU components on five large-scale computing centers with modern GPUs: Oak Ridge's Summit, Sandia's Vortex, TACC's Frontera and Longhorn, and Livermore's Corona. These clusters use a variety of cooling methods and GPU vendors. In total, we collect over 18,800 hours of data across more than 90% of the GPUs in these clusters. Regardless of the application, cluster, GPU vendor, and cooling method, our results show significant variation: 8% (max 22%) average performance variation even though the GPU architecture and vendor SKU are identical within each cluster, with outliers up to 1.5X slower than the median GPU. These results highlight the difficulty in efficiently using existing GPU clusters for modern HPC and scientific workloads, and the need to embrace variability in future accelerator-based systems.

翻译：科学家们正在越来越多地探索和利用一般用途加速器的大规模平行作用,例如用于科学突破的GPU(GPU)等通用加速器。因此,数据中心、超标、国家计算中心和超级计算机采购了硬件,以支持这种不断发展的应用模式。这些系统包含数百至数万个加速器,使Peta 和 exacal 的计算水平能够用于科学工作量。最近的工作表明,电力管理(PM)可以影响基于CPU的HPC系统的应用性能,即使机器拥有相同的结构以及SKU(库存维护单位)。这种差异是由于制造变异和芯片的PM。然而,尽管现代HPC系统广泛使用加速器,例如GPU,但这种变异性影响应用的程度还不清楚。因此,我们试图确定在现代HPC和超级计算机系统中的GPUPPP(G)的变异程度。我们研究的是,在5个大型计算中心的变异性计算中心(Ok Rock Ri) 峰会、Sandridia's Vordal 的变异性(S) 18 Crow Grow Grow 方法需要这些变异性GRestal 和LA 的G) 。