The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel's first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel's first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science.
翻译:Aurora超级计算机于2024年在阿贡国家实验室部署,是目前Top500榜单上全球三台百亿亿次级(Exascale)计算系统之一。该系统由超过一万个节点组成,每个节点包含六块英特尔数据中心Max系列GPU(英特尔首款面向数据中心的独立GPU)以及两颗英特尔至强Max系列CPU(英特尔首款集成高带宽内存的至强处理器)。为实现百亿亿次级性能,系统采用HPE Slingshot高性能互连网络连接各节点。Aurora是目前Slingshot架构最大规模的部署,通过蜻蜓拓扑连接近85,000个Cassini网卡和5,600个Rosetta交换机。英特尔驱动的节点与Slingshot网络的结合,使Aurora在2024年6月成为Top500榜单上第二快的系统,并在HPL MxP基准测试中位列第一。该系统是全球专用于开放科学人工智能与高性能计算模拟的最强大系统之一。本文详细阐述Aurora系统设计,特别聚焦于网络架构及其验证方法。通过展示在系统大部分节点上运行的MPI基准测试(包括HPL、HPL-MxP、Graph500和HPCG)结果,论证了系统性能。此外,本文还呈现了包括HACC、AMR-Wind、LAMMPS和FMM在内的多样化应用性能数据,证明Aurora能够提供跨系统所需的吞吐量、延迟和带宽,支持应用程序在大规模节点数量上高效运行与扩展,从而提供全新能力水平并推动突破性科学研究。