The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.
翻译:互连是大规模计算系统最关键的部件之一,对应用性能的影响随着系统规模的大小而增加。 在本文中,我们将描述 Slingshot, 这是大规模计算系统的互联网络。 Slingshot 以高射线开关为基础,它允许建造有最多三个开关至开关跳的缩略图和超超大数据中心网络。 此外, Slingshot 提供了高效的适应性路由和拥堵控制算法, 以及高金枪鱼分量的交通类别。 Slingshot 使用一种优化的Ethernet 协议, 它使得它能够与标准的 Ethernet 设备互操作, 同时为 HPC 应用程序提供高性能。 我们分析了 Slingshot 提供这些特征的程度, 在微波纹线标记和来自数据中心和AI 世界的若干应用中, 以及HPC 应用程序上。 我们发现, Slingshot 上的应用与前一代网络相比, 受拥堵的影响较小 。