Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
翻译:云计算是以成本效益高低的方式将高频常委会工作量部署到最合适的硬件上的一个极好机会。然而,虽然云层和预设高频常委会系统提供类似的计算资源,但其网络结构和性能可能大不相同。例如,这些系统使用截然不同的网络运输和路由协议,可能引入最终限制应用规模的网络噪音。这项工作分析了网络性能、可扩缩性以及运行云系统高频常委会工作量的成本。首先,我们考虑在详细的小规模测量中采用延时、带宽和集体通信模式,然后在更大的规模上模拟网络性能。我们验证了我们对四个流行的云源供应商和三个预设高频常委会系统的做法,表明网络(以及操作系统)的噪音可以对规模大小的运行产生重大影响和成本。